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Abstract 

We describe the application of tools from statistical mechanics to analyse the dynamics 
of various classes of supervised learning rules in perceptrons. The character of this paper 
is mostly that of a cross between a biased non-encyclopedic review and lecture notes: we 
try to present a coherent and self-contained picture of the basics of this field, to explain 
the ideas and tricks, to show how the predictions of the theory compare with (simulation) 
experiments, and to bring together scattered results. Technical details are given explicitly 
in an appendix. In order to avoid distraction we concentrate the references in a final 
section. In addition this paper contains some new results: (i) explicit solutions of the 
macroscopic equations that describe the error evolution for on-line and batch learning 
rules, (m) an analysis of the dynamics of arbitrary macroscopic observables (for complete 
and incomplete training sets), leading to a general Fokker-Planck equation, and (Hi) the 
macroscopic laws describing batch learning with complete training sets. We close the 
paper with a preliminary expose of ongoing research on the dynamics of learning for 
the case where the training set is incomplete (i.e. where the number of examples scales 
linearly with the network size). 



(to be published in 'Statistics and Computing') 
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1 Introduction 



1.1 Supervised Learning in Neural Networks 

In this paper we study the dynamics of supervised learning in artificial neural networks. The 
basic scenario is as follows. A 'student' neural network executes a certain known operation 
S : D ^ R, which is parametrised by a vector J, usually representing synaptic weights 
and/or neuronal thresholds. Here D denotes the set of all possible 'questions' and R denotes 
the set of all possible 'answers'. The student is being trained to emulate a given 'teacher', 
which executes some as yet unknown operation T : D ^ R. In order to achieve the objective 
the student network S tries to gradually improve its performance by adapting its parameters 
J according to an iterative procedure, using only examples of input vectors (or 'questions') 
^ e 3f?^ which are drawn at random from a fixed training set D <Z D oi size \D\, and 
the corresponding values of the teacher outputs r(^) (the 'correct answers'). The iterative 
procedure (the 'learning rule') is not allowed to involve any further knowledge of the operation 
T. As far as the student is concerned the teacher is an 'oracle', or 'black box'; the only 




Figure 1: The general scenario of supervised learning: a 'student network' S is being 'trained' 
to perform an operation T : D ^ -R by updating its control parameters J according to an 
iterative procedure, the 'learning rule'. This rule is allowed to make use only of examples of 
'question/answer pairs' (^,T'(^)), where ^ G Z? C L). The actual 'teacher operation' T that 
generated the answers T{^), on the other hand, cannot be observed directly. The goal is to 
arrive at a situation where S{^) = T(^) for all £ D. 

information available about the inner workings of the black box is contained in the various 
answers T(^) it provides. See figure 0. For simplicity we will assume each 'question' ^ to be 
equally likely to occur (generalization of what follows to the case where the questions ^ carry 
non- uniform probabilities or probability densities p{$,) is straightforward). 

We will consider the following two classes of learning rules, i.e. of recipes for the iterative 
modification of the student's control parameters J, which we will refer to as on-line learning 
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rules and batch learning rules, respectively: 



On- Line : J{t + 1) = J{t) + F 
Batch: J{t + 1) = J{t) + {F 



m,Jit),Tm) 



(1) 



D 



The integer variable i = 0, 1, 2, 3, . . . labels the iteration steps. In the case of on-line learning 
an input vector ^(t) is drawn independently at each iteration step from the training set D, 
followed by a modification of the control parameters J. Therefore this process is stochastic 
(Markovian). In the case of batch learning the modification that would have been made in 
the on-line version is averaged over the input vectors in the training set D, at each iteration 
step. This process is therefore a deterministic iterative map[|. Both rules in (||) can formally 
be written in the general form of a Markovian stochastic process. We introduce the proba- 
bility density Pt{J) to find parameter vector J at discrete iteration step t. In terms of this 
microscopic probability density the processes (Q) can be written as: 



Pt+i{J) = JdJ' W[J;J']pt{J') 



(2) 



with the transition probability densities 



On- Line : W[J; J'] = {S | J- J'-F 
Batch: W[J; J'] = 6 { J -J' - {F 



'D 



'D 



(3) 



(in which 6[z\ denotes the delta-distribution). The advantage of using the on-line version of 
the learning rule is a reduction in the amount of calculations that have to be done at each 
iteration step; the price paid for this reduction is the presence of fluctuations, with as yet 
unknown impact on the performance of the system. 

We will denote averages over the probability density pt{J), averages over the full set D 
of possible input vectors and averages over the training set D in the following way: 



{g{J))= dJpt{J)g{J) 



{Km 



D 



1 



{m))D 



1 



The average {K($,))^ will in general depend on the microscopic realisation of the training 
set D. To quantify the goal and the progress of the student one finally defines an error 
i?[T(^), = E[T{^), f[^; J]], which measures the mismatch between student answers 

and correct (teacher) answers for individual questions. The two key quantities of interest in 
supervised learning are the (time-dependent) averages of this error measure, calculated over 
the training set D and the full question set D, respectively: 



Training Error : 
Generalization Error : 



E,{j) = {Em),mj]])^ 
E,{j) = {Em),mj]])D 



(4) 



^Clearly one could define an infinite number of intermediate classes of learning rules (e.g. learning with 
'momentum'); the present two are just the extreme cases. Note also that the term 'batch' unfortunately means 
different things to different scientists. The definition used here is sometimes described as 'off-line'. 
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These quantities are stochastic observables, since they are functions of the stochastically 
evolving vector J . Their expectation values over the stochastic process @i are given by 



Mean TVaining Error : {EC) = {{E[T{i), /[^; J]]))^ 

Mean Generalization Error : (Eg) = {{E[T{^), f[^; J]]))d 



(5) 



Note that the prefix 'mean' refers to the stochasticity in the vector J; both (Et) and (Eg) 
will in general still depend on the realisation of the training set D. 

The training error measures the performance of the student on the questions it could have 
been confronted with during the learning stage (in the case of on-line learning the student 
need not have seen all of them). The generalization error measures the student's performance 
on the full question set and its minimisation is therefore the main target of the process. The 
quality of a theory describing the dynamics of supervised learning can be measured by the 
degree to which it succeeds in predicting the values of (Et) and (-Eg) as a function of the 
iteration time t and for arbitrary choices made for the function F[. . .] that determines the 
details of the learning rules (|l]). 

1.2 Statistical Mechanics and Its Applicability 

Statistical mechanics deals with large systems of stochastically interacting microscopic ele- 
ments (particles, magnets, polymers, etc.). The general strategy of statistical mechanics is to 
abandon any ambition to solve models of such systems at the microscopic level of individual 
elements, but to use the microscopic laws to calculate laws describing the behaviour of a suit- 
ably choosen set of macroscopic observables. The toolbox of statistical mechanics consists of 
various methods and tricks to perform this reduction from the microscopic to a macroscopic 
level, which are based on clever ways to do the bookkeeping of probabilities. The experience 
and intuition that has been built up over the last century tells us what to expect (e.g. phase 
transitions), and serves as a guide in choosing the macroscopic observables and in seeing the 
difference between relevant mathematical subtleties and irrelevant ones. As in any statistical 
theory, clean and transparent mathematical laws can be expected to emerge only for large 
(preferably infinitely large) systems. 

Supervised learning processes as described in the previous subsection appear to meet the 
criteria for statistical mechanics to apply, provided we are happy to restrict ourselves to large 
systems {N — > cxd). Here the microscopic stochastic dynamical variables are the components 
of the vector J, and one is as little interested in knowing all individual components of J as 
one would be in knowing all position coordinates of the molecules in a bucket of water. We are 
rather after the generalization and training errors, which are indeed macroscopic observables. 

Further support for the applicability of statistical mechanics is provided by numerical 
simulations. Consider, for instance, the example of the ordinary perceptron learning rule. For 
simplicity we choose D = D = { — 1, 1}^, with a task T generated by a 'teacher perceptron', 
corresponding to the following choices in the language of the previous subsection: 



SiO = sgn(J.^) 



r(|) = sgn(S.C) 



5 



CJ 




t 





2000 



t 



t 



Figure 2: Evolution in time of the observable lo = J ■B/\J\ during numerical simulations 
of the standard perceptron learning rule (with a randomly drawn normalised teacher weight 
vector B), following random initialisations of the student weight vector J . 



with J,B ^ 3f?^. The teacher weight vector B is choosen at random, and normalised accord- 
ing to \B\ = 1. The (on-line) perceptron learning rule is 



j{t+i) = j{t) + ^{t) e 



with the step function 9[z > 0] = 1, 6[z < 0] = 0. An educated guess for a possibly relevant 
macroscopic observable is the object that also played a central role in the original perceptron 
convergence proof: u){t) = J{t)-B /\J{t)\. The result of measuring the value of a;(t) during 
the execution of the above learning rule is shown in figure These experiments clearly 
suggest (keeping in mind that for specifically constructed pathological teacher vectors the 
picture might be different), that if viewed on the relevant A^-dependent time-scale (as in the 
figure), the fluctuations in uj become negligible as — > oo, and a clean deterministic law 
emerges. This is the type of situation we need in order to use statistical mechanics, and 
finding an analytical expression for this deterministic law will be our goal. 

As a second example we will choose a two-layer network, trained according to the error 
back-propagation rule. Here the microscopic stochastic variables are both the weights {VFjj} 
from the input to the hidden layer (of L neurons) and the weights { Jj} from the hidden layer 
to the output layer. We define D = D = {—1, 1}^ and 



S{$,) = tanh 



i=l 



yi{^) = tanh 



We consider two types of tasks, a linearly separable one (which is always learnable by the 
present student), and the parity operation (which is learnable only for L > K): 

task I : r(^) = sgn(B-^) e {-1, 1} 

^£{-1,1}'': 

task II: r(^) = n^iCie{-i,i} 
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Figure 3: Evolution of the overall error ii^ in a two-layer feed-forward network, trained by 
error backpropagation (with K = \f> input neurons, L = 10 hidden neurons, and a single 
output neuron). The results refer to independent experiments involving either a linearly 
separable task (with random teacher vector, lower curves) or the parity operation (upper 
curves), following random initialisation. 



Our macroscopic observable will be the mean error (since D 
and the training error are here identical): 



D the generalization error 



E = {E[T{i),S{i)])D, 



1 



E[T{i),S{i)] = -[T{i)-S{i)Y 



Perfect performance would correspond to £' = 0. On the other hand, a trivial perceptron 
with zero weights throughout would give S{$,) = Q so E = \{T^{$,)) D = \. The learning rule 
used is the discretised on-line version (with learning rate e) of the error backpropagation rule: 



Ut+e) = J,{t)-e^E[Tm\Sm)] 



W^j{t+e) = W^,{t)-e^E[Tm),Sm)] 



The results of doing several such simulations, for K = 15 and L = 10 (so that the parity 
operation is an unlearnable task for the student network) and following random initialisation 
of the various weights, are shown in figure |^. These experiments again clearly suggest that also 
for multi-layer networks statistical mechanics will be a natural tool to analyse the dynamics 
of learning. Provided we scale our parameters appropriately and take a suitable limit (there 
will be different equivalent ways of doing this) the fluctuations in suitably chosen macroscopic 
observables can be made to vanish, such that transparent deterministic laws emerge. 



1.3 A Preview 

There are two main classes of situations in the supervised learning arena, which differ funda- 
mentally in their dynamics and in the degree to which we can analyse them mathematically. 
The first class is the one where the training set D is what we call 'complete': sufficiently 
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large and sufficiently diverse to lead to a learning dynamics which in the limit ^ co is 
identical to that of the situation where D = D. For example: in single perceptrons and in 
multi-layer perceptrons with a finite number of hidden nodes one finds, for the case where 
D = {—1, 1}^ and where the members of the training set D are drawn at random from D, 
that completeness of the training set amounts to \\m]\j^oo N /\D\ = 0. This makes sense: it 
means that for — > oo there will be an infinite number of training examples per degree of 
freedom. For this class of models it is fair to say that the dynamics of learning can be fully 
analysed in a reasonably simple way|^. Because this situation is now so nicely under control 
and admits for analytical solutions, it is a nice area to describe in a self-contained way in 
a paper such as the present one. We will restrict ourselves to single perceptrons with vari- 
ous types of learning rules, since they form the most transparent playground for explaining 
how the mathematical techniques work. For multi-layer perceptrons with a finite number of 
hidden neurons and complete training sets the procedure to be followed is very similar]^. 

The picture changes dramatically if we move away from complete training sets and con- 
sider those where the number of training examples is proportional to the number of degrees 
of freedom, i.e. in simple perceptrons and in two-layer perceptrons with a finite number of 
hidden neurons this implies \D\ = aN (0 < a < oo). Now the dependence of the microscopic 
variables J on the realisation of the training set D is non- negligible. However, if the questions 
in the training set are drawn at random from the full question set D one often finds that 
in the N ^ oo limit the values of the macroscopic observables only depend on the size \D\ 
of the training set, not on its microscopic realisation. For those familiar with the statistical 
mechanical analysis of the operation of recurrent neural networks: learning dynamics with 
complete training sets is mathematically similar to the dynamics of attractor networks away 
from saturation, whereas learning dynamics with incomplete training sets is similar, if non 
equivalent, to the dynamics of attractor networks close to saturation (in turn equivalent to 
the complex dynamics of spin-glasses). Here one needs much more powerful mathematical 
tools, which are as yet only partly available. This class of problems is therefore only begin- 
ning to be studied, and we cannot yet give a well rounded overview with a happy ending (as 
for the case of complete training sets). We will do the next best thing and try to explain as 
clearly as possible what the problem is. 

No review is unbiased and complete; and one always has to strike a balance between 
broadness and depth (equivalently: between being encyclopedic and being self-contained). 
Here we have opted for the latter. As a result, the references we give are intended to serve 
as a guide only, not as a true reflection of all the work that has been done; for each paper 
mentioned at least fifty will have been left out, and we wish to apologise beforehand to the 
authors of the papers in the latter category. We aim to explain the ideas and techniques only 
for a subset of the field, in the hope that the text can then be sufficiently self-contained to 
serve not just the interested spectator but also those who wish to become actively involved. 



^In these models one can still study interesting new phenomena, such as the effects of having noisy teachers 
etc., but at least the route to be followed to solve them is well-defined and guaranteed to work. 

^The situation is different if we try to deal with multi-layer perceptrons with a number of hidden neurons 
which scales linearly with the number of input channels A^. As far as we are aware, this still poses an unsolved 
problem, even in the case of complete training sets. 
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2 On-Line Learning: Complete Training Sets and Explicit 
Rules 



We will now derive explicitly macroscopic dynamical equations that describe the evolution in 
time for the error in large perceptrons, trained with several on-line learning rules to perform 
linearly separable tasks. In this section we restrict ourselves to complete training sets D = 
D = {—1, 1}^. There is consequently no difference between training and generalization error, 
and we can simply define E = lim/vr-»oo(-E'g) = limjv^oo(-E't)- 

2.1 General On-Line Learning Rules 

Consider a linearly separable binary classification task T:{ — 1,1}^— >{— 1,1}. It can be 
regarded as generated by a teacher perceptron with some unknown weight vector B G 51?^, 
i.e. r(^) = sgn(S-^), normalised according to \B\ = 1 (with the sign function sgn(2: > 
0) = 1, sgn(2 < 0) = —1). A student perceptron with output S{$,) = sgn(J-^) (where 
J G is being trained in an on-line fashion using randomly drawn examples of input 

vectors ^ G {—1; 1}^ with corresponding teacher answers T{^). The general picture of figure 
|l] thus specialises to figure |^. We exploit our knowledge of the perceptron's scaling properties 




Figure 4: A student perceptron S is being trained according to on-line learning rules to 
perform a linearly separable operation, generated by some unknown teacher perceptron T. 

(see figure |2|), and distinguish between the discrete time unit in terms of iteration steps, from 
now on to be denoted by ;U = 1, 2, 3, . . ., and the scale-invariant time unit = fi/N. Our 
goal is to derive well-behaved differential equations in the limit ^ cxd, so we require weight 
changes occurring in intervals At = to be of order 0{jj) as well. In terms of equation (|l|) 
this implies that F[. . .] = 0{j^). If, finally, we restrict ourselves to those rules where weight 
changes are made in the direction of the example vectors (which includes most popular rules), 
we obtain the generic^ recipe 

j{t, + 1) = j(v) + ^7i{t,)e sgn{B.e)HJM,Jit^ye,B.e] (6) 

*One can obviously write down more general rules, and also write the present recipe m) in diflterent ways. 
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Here rj(t^) denotes a (possibly time-dependent) learning rate and is the input vector 
selected at iteration step /i. J^[. . .] is an as yet arbitrary function of the length of the student 
weight vector and of the local fields u and v of student and teacher (note: J- can depend on 
the sign of the teacher field only, not on its magnitude). For example, for J^[J;u,v] = 1 we 
obtain a Hebbian rule, for J-[J;u,v] = 6[—uv] we obtain the perceptron learning rule, etc. 

We now try to solve the dynamics of the learning process in terms of the two macroscopic 
observables that play a special role in the perceptron convergence proof: 

Q[J] = J2 R[J] = J B (7) 

(at this stage the selection of observables is still no more than intuition-driven guesswork). 
The formal approach would now be to derive an expression for the (time-dependent) proba- 
bility density P{Q,R) = {6[Q — Q[J]]6[R — R[J]]) , however, it turns out that in the present 
cas^ there is a short-cut. Squaring (^ and taking the inner product of (P) with the teacher 
vector B gives, respectively 

+ ^^\t,):F'[\J{t,)\;J{t,ye,B-e] 

R[Jit^+j^)] = R[j{tf,)] + ^v{t>.)\B-e\n\J{t>.)\;j{t>.ye,B.e] 

(note: ^^'-^^ = A^). After i discrete update steps we will have accumulated i such modifica- 
tions, and will thus arrive at: 

Q[Jit^+£/N)] - QjJjt^)] _ 

e/N 

— — — 



+-)-r~^'",s.e 



i J2 {2v{t,+^){j{t,+^ye^n sgn{B-e^nHJ{t,+^)\;j{t, 

R[J{tf,+£/N)] - R[J{t^)] ^ 

£/N 

m=0 

All is still exact, but at this stage we will have to make an assumption which is not entirely 
satisfactory^. We assume that J(t^+^)-^''+"' ^ J(t^)-^^+"' if iV ^ oo for finite m. This is 
only true in a probabilistic sense, since, although Ji{t^+^) = Ji{t^)+0{^), the inner product 
is a sum of N terms. If for now, however, we accept this step and also choose learning rates 
which vary sufficiently slowly over time to guarantee existence of the limit lim7v-»oo ^(i^)) 

^This will be different in the case of incomplete training sets. 

^We will later find out that a more careful analysis gives the same results. 
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we find that by taking the limit N ^ oo, followed by the limit £ — > cxd, three pleasant 
simplifications occur: (i) the time unit = fj,/N becomes a continuous variable, (ii) the 
left-hand sides of the above equations for the evolution of the observables Q and R become 
temporal derivatives, and (iii) the summations in the right-hand sides of these equations 
become averages of the training set. Upon putting Q{t) = Q[J{t)] and R{t) = R[J{t)] the 
result can be written as: 

^^Q(t) = 2r^mjm) sgn(B.O^[g^(t);J(t)-^,S.^])^+,72(^)^_^2[gi(^).j(^).^^^.^j^^ 



-Rit) = rjmB-mQHty,jm,B-$])f, 



d 

IE' 

The only dependence of the right-hand sides of these expressions on the microscopic variables 
J is via the student fields J{t)-^ = Q2[t)J{t)-^, with J = J/|J|[]. We therefore define the 
stochastic variables x = J-^ and y = B $, and their joint probability distribution Pt{x,y): 

Pt{x, y) = {6[x- J mny-B {f{x,y))= Jdxdy Pt{x,y)f{x,y) (8) 

Using brackets without subscripts for joint field averages cannot cause confusion, since such 
expressions always replace averages over J, rather than occur simultaneously. Our previous 
result now takes the form 

j^Qit) = 27?(t)Q5 {t){x sgn(y).F[g5 (t); {t)x, y]) + r,\t){J^'' [Q5 (t); {t)x, y]) (9) 

j^R{t) = rjit){\y\J^[QHty,Qkt)x,y]) (10) 

Since the operation performed by the student does not depend on the length | J| of its weight 
vector, and since both Q and R involve \J\, it will be convenient at this stage to switch to 
another (equivalent) pair of observables: 

J{t) = \J{t)\ uj{t)=B-J{t) (11) 

Using the relations = '^J^J and -^R = J^uj + oj-^J, and upon dropping the various 
explicit time arguments (for notational convenience) we then find the compact expressions 

= rj{x sgn{y)nj;Jx,y]) + ^{TV;Jx,y]) (12) 

1^ = l{[\y\-ojx sgn{y)]T[J;Jx,y]) - '^{TV;Jx,y]) (13) 

Unless we manage to express P{x,y) in terms of the pair (J, w), however, the equations 
(12,1^) do not constitute a solution of our problem , since we would still be forced to solve 



the original microsopic dynamical equations in order to find P{x, y) as a function of time and 
work out ([l^JT3|) . 



The final stage of the argument is to assume that the joint probability distribution (g) 
has a Gaussian shape, since D = {—1, 1}^ and since all ^ G Z) contribute equally to the 

'^This property of course depends crucially on our choice (Id) made for the form of the learning rules. 
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average in (^). This will be true in the vast majority of cases, e.g. it is true with probability 
one if the vectors J and B are drawn at random from compact sets like [—1,1]^, due to the 
central limit theorem^ Gaussian distributions are fully specified by their first and second 
order moments, which are here calculated trivially using (^j) = and [iiij) = ^ij'- 

(x) = ^ M^i) = (y) = E = 

i i 

{x") = E JJ, iCiCj) = 1 ^2/') = E BiBj i^i^j) = 1 (xy) = E JiBj iiii,) 

ij ij ij 

giving 



UJ 



■^[x^+y^-2xyLu]/{l-uj^) 



P{x, y) = ' ^ ^ (14) 



■ UJ 



Note that P{x,y) = P{y,x). The simple fact that P{x,y) depends on time only through 
u ensures that the two equations (1^,1^) are a closed set. Note also that now (|T^ , [l3|) are 



deterministic equations; apparently the fluctuations in the macroscopic observables Q[J] and 
R[J] vanish in the — > oo limit. 

Finally, the generalization error Eg (here identical to the training error Et due to D = D) 
can be expressed in terms of our macroscopic observables. We define the error made in a 
single classification of an input ^ as E[T{^), S{^)] = 6[—{B-^){J-^)] £ {0, 1}. Averaged over 
D this gives the probability of a misclassification for randomly drawn questions $, £ D: 

hm Eg{J{t)) = hm m-{B.^){J{t).mD = {0[-xy]) 

iV— >oo N-^oo 

= / dxdy[P{x,-y) + P{-x,y)] 
Jo Jo 

The generalization error (from this stage onwards to be denoted simply by E) also evolves 
deterministically for ^ cxd, and can be expressed purely in terms of the observable uj. 
The integral (with the distribution (0)) can even be done analytically (see appendix) and 
produces the simple result 

E = — arccos(u;) (15) 

TT 

The macrosopic equations ([T2|,13) can now equivalently be written in terms of the pair (J, E). 
We have hereby achieved our goal: we have derived a closed set of deterministic equations 
for a small number (two) of macroscopic observables, valid for N ^ oo, and we know the 
generalization error at any time. 

2.2 Hebbian Learning with Constant Learning Rate 



We will now work out our general result ( |12|jl3|jl4| ) for specific members of the general class 



(|6D of on-line learning rules. The simplest non-trivial choice to be made is the Hebbian rule, 
obtained by choosing J^[J; Jx,y] = 1, with a constant learning rate t/: 



J{t, + 1) = J{t,) + sgn{B.e) (16) 



It is not true for all choices of J and B. A trivial counter-example is Jk = <5fei, less trivial counter-examples 
are e.g. Jk = e~'' and Jk = with 7 > 5. 
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Figure 5: Flow in the {E^ J) plane generated by the Hebbian learning rule with constant 
learning rate r/, in the limit N oo. Dashed: the line where dE/dt = {dJ/dt > for any 
{E, J)). Note that the flow asymptotically gives E ^ and J oo. 



Equations (|T^) and (13), describing the macroscopic dynamics generated by (16) in the limit 
N ^ oo now become 



d Tj^ 
— J = r]{x sgn(y)) + — 



d 7] uri^ 
-u=-{\y\-u;x sgn{y))-^ 



or, in more explicit form with the function P{x, y) (|14D : 
d f f 

—J = 1] dxdyx sgn{y)P{x,y) + — 



d 

— LO 

dt 



T] 



J 



-^11 dxdy \y\P{x,y) ^ / / dxdyx sgn{y)P{x,y) 



J 



CUT] 

2J2 



The integrals in these equations can be calculated analytically (see appendix) and we get 



— J = ujn\ - + — 
dt ' V vr 2 J 



d 2\'n 

dt^=^^-^hi-.-%p 



Thus, upon elimination of the observable to using equation (15), we arrive at the following 
closed differential equations in terms of J and E: 



d [2 if 

-J = ,cos(.E)^- + - 



dt 



rjsm{TTE) j2 ^ 



ttJ V TT 27r J2 tan(7r£') 



(17) 
(18) 
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The flow in the {E, J) plane described by these equations is drawn in figure ^ (which is 
obtained by numerical solution of (17,18)). From ( p^ it follows that ^ J > Vt > 0. From 



1^) it follows that -^E = along the line 

ijcosjiTE) [¥ 
Jc[E) - . ^ W - 
2sm [ttE) V ^ 



(drawn as a dashed line in figure ^) . 

Let us now investigate the temporal properties of the solution (17,18), and work out 
their predictions for the asymptotic decay of the generalization error. For small values of E 
equations (17,|l8) yield 

From dH) we infer that J ~ V^\f^ foi' t ^ oo. Subsitution of this asymptotic solution into 
equation (pO|) gives 



We insert the ansatz E = At~°' into equation ( [2l| ) and get the solution ^ = l/v27r, a = 1/2. 
This implies that (in the — > oo limit) on-line Hebbian learning with complete training sets 
produces an asymptotic decay of the generalization of the form 

E ^ (t oo) (22) 

Figures ^ ^ and ^ will show the theoretical results of this section together with the results of 
doing numerical simulations of the learning rule (^) and with similar results for other on-line 
learning rules with constant learning rates. The agreement between theory and simulations 
is quite convincing. 

2.3 Perceptron Learning with Constant Learning Rate 

Our second application of (|l2| , p^Jl^) is making the choice J^[J;Jx,y] = 9[—xy] in equation 
(pi), with constant learning rate ry, which produces the perceptron learning algorithm: 



Jit, + ^) = Jii,) + ^eo [-iB.e)iJ{t,ye)] (23) 



In other words: the student weights are updated in accordance with the Hebbian rule only 
when sgn(S-^) = — sgn( J-^), i.e. when student and teacher are not in agreement. Equations 
(12,13) now become 



d Tj^ 

—J = r]{x sgn{y)0[-xy]) + ^{9[-xy]) 



I J I dxdy X sgD.{y)e[-xy]P(x,y) + j j dxdy 0[-xy]P{x,y) 
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Figure 6: Flow in the {E, J) plane generated by the perceptron learning rule with constant 
learning rate rj, in the limit N — > oo. Dashed: the two lines where dE/dt = and dJ/dt = 0, 
respectively. Note that the flow is attracted into the gully between these two dashed lines 
and asymptotically gives — > and J — > oo. 



d Tj UJTJ 

J^^ = -j{M-^x sgn{y)]e[-xy\) - ^{e[-xy\) 



= -J J j dxdy \y\e[-xy\P{x,y) - ^ j j dxdy x sgn{y)e[-xy]P{x,y) 

uj-rf f f 
- / / dxdy e[-xy]P{x,y) 

with P{x,y) given by (14). As before the various Gaussian integrals occurring in these 
expressions can be done analytically (see appendix), which results in 



dt 



r/(l — cj) r] 



+ 



arccos(tj) 



d 

—to 
dt 



7/(1— (jj^) UJT]'^ 



■ arccos(w) 



/2tt 2ttJ ^ ' dt 27rJ2 

Elimination of uj using (H) then gives us the dynamical equations in terms of the pair {J,E): 



dt 



d 
It 



E 



r]{l — cos{TTE)) rfE 
rfE 



^27r 
?7sin(7r£^) 



+ 



7r^/2^J 271, P tan(7r£;) 



(24) 
(25) 



Figure |g shows the flow in the {E, J) plane, obtained by numerical solution of (24,^). The 
^J = Oand where |. 



two lines where 4fJ = and where 4tE = are found to be Jc^i{E) and Jc,2{E), respectively: 



J,^i{E)=riJ- 



E 



2 l-cos(7r£;) 



Jc,2{E)=V\h 



IT EcOs{'7tE) 

2 l-cos2(7r£;) 
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For E G [0,1/2] one always has Jc^i{E) > Jc,2iE), with equahty only if (J, E) = (oo,0). 
Figure |^ shows that the flow is drawn into the gully between the curves Jc,i{E) and Jc,2{E). 
As with the Hebbian rule we now wish to investigate the asymptotic behaviour of the 



generalization error. To do this we expand equations (24,25) for small E: 



dt 2^/2^ 2J 

d TjE ■rfE'^ /n/z7i3\ 

d^^ = "72^ + ^^""6J^ + ^^^ ^ 
For small E and large t we know that J ~ Jc,i{E) ~ 1/E. Making the ansatz J = A/E 
(and hence ^E = —^-^J) leads to a situation where we have two equivalent differential 
equations for E: 

Since both describe the same dynamics, the leading term of the second expression should be 
identical to that of the first, i.e. 0{E^), giving us the condition A = Substitution of 

d_ 

dt' 

d „ 1 



this condition into the first expression for ^^E then gives 



which has the solution 



-ox 1/3 

F~(-j 7r-H"i/3 (t^oo) (26) 



We find, somewhat surprisingly, that in large systems {N — > oo) the on-line perceptron 
learning rule is asymptotically much slower in converging towards the desired F = state 
than the simpler Hebbian rule. This will be different if we allow for time-dependent learning 
rates. Figures ^, |9| and |l^ will show the theoretical results on the perceptron rule together 
with the results of doing numerical simulations and together with similar results for other on- 
line learning rules. Again the agreement between theory and experiment is quite satisfactory. 



2.4 AdaTron Learning with Constant Learning Rate 

As our third application we analyse the macroscopic dynamics of the AdaTron learning rule, 
corresponding to the choice T[J] Jx,y\ = \ Jx\d[—xy\ in the general recipe (^). As in the 
perceptron rule, modifications are made only when student and teacher are in disagreement; 
however, here the modification made is proportional to the magnitude of the student's local 
field. Students are punished in proportion to their confidence in the wrong answer. The 
rationale is that wrong student answers S'(^) = sgn(J • ^) with large values of | J • ^| require 
more rigorous corrections to J to be remedied than those with small values of | J • ^|. 

j{t, + h = J{t,) + sgn{B.e)\j{t,ye\o[-{B-e)iJ{t,ye)] m 
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Figure 7: Flow in the {E, J) plane generated by the AdaTron learning rule with constant 
learning rate = 1, in the limit — > oo (in this case the influence of the value of the learning 
rate on the flow is more than just a rescaling of the length J). 



Working out the general equations (^,13) for the learning rule ( p7[ ) gives 

—J = rjj J J dxdy x\x\ sg\i{y)6[—xy\P{x,y) + i^V^J j j dxdy x'^9[—xy]P{x,y) 



uj = rj dxdy \xy\d[—xy]P{x,y) — r]u> dxdy x\x\ sgn{y)9[—xy]P{x,y) 



1 



d 

— t 
dt 



All integrals can again be done analytically (see appendix), so that we obtain explicit macro- 
scopic flow equations: 



Ljr]'^ / / dxdy x'^9[—xy]P{x,y) 



— J= _ /2(t^ 
dt uj 2 



j^uj = r]Ii{uj) -[ri- '^]h{^) 



with the short-hands 



= arccos(u;) H 



TT vr 
i2(wj = arccos(cj) + 



vr 

2 , ,3 



— arccos(a;) 
vr 



■ arccos w 



vr vr vr 

The usual translation from equations for the pair (J, a;) into one involving the pair (J,E), 
following ([l5|), turns out to simplify matters considerably, since it gives 

o2 



dt ^2 



E 



cos{ttE) sin(vr£') 



vr 



(28) 
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dt 7r2 ^ 27rtan(^£;) 2tt^ ^ ' 

The flow described by the equations (28,29) is shown in figure ^, for the case r/ = 1. In contrast 
with the Hebbian and the perceptron learning rules we here observe from the equations (2^.29) 
that the learning rate rj cannot be eliminated from the macroscopic laws by a rescaling of 
the weight vector length J. Moreover, the state £' = is stable only for < 3, in which case 
j^E < for ah t. For < 2 one has <0 for all t, for t? = 2 one has J{t) = J(0) for ah t, 
and for 2 < < 3 we have -^J > for all t. 
For small E equation ( |29| ) reduces to 

giving 

E ~ {t - oo) (30) 

?7(3-r/) 



For 7] = 1, which gives the standard representation of the AdaTron alrorithm, we find 
E ~ Note from equation (]2^ ) that for the AdaTron rule there is a value for rj which 

normalises the length J of the student's weight vector, r] = 2, which again gives 

3 
2 



The optimal value for r/, however, is r/ = | in which case we find ~ It ^ (see (|30|)). 



2.5 Theory Versus Simulations 

We close this section with results of comparing the dynamics described by the various macro- 
scopic flow equations with the results of measuring the error E during numerical simulations 
of the various (microscopic) learning rules discussed so far. This will serve to support the 
analysis and its implicit and explicit assumptions, but also illustrates how the three learning 
rules compare among one another. Figures |8| and ^ show the initial stage of the learning 
processes, for initialisations corresponding to random guessing {E = 0.5) and almost correct 
classification {E small), respectively (note that for the perceptron and adatron rules starting 
at precisely E = produces in finite systems a stationary state). Here the solutions of the 
flow equations (solid lines) were obtained by numerical iteration. The initial increase in the 
error E, as observed for the Hebbian and perceptron rule, following initialisation with small 
values of E can be understood as follows. The error depends only on the angle of the weight 
vector J, not on its length J, this means that the modifications generated by the Hebbian 
and Perceptron learning rules (which are of uniform magnitude) generate large changes in 
E when J is small, but small changes in E when J is large, with corresponding effects on 
the stability of low E states. The AdaTron rule, in contrast, involves weight changes which 
scale with the length J, so that the stability of the E = state does not depend on the value 



of J. Figure 10 shows the asymptotic relaxation of the error E, in a log-log plot, together 



with the three corresponding asymptotic (power law) predictions (|22,26,^). All simulations 
were carried out with networks of = 1000 neurons, which apparently is already sufficiently 
large for the N = oo theory to apply. The teacher weight vectors B were in all cases drawn 
at random from [—1, 1]^. We conclude that the theory describes the simulations essentially 
perfectly. 
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Figure 8: Evolution in time of the generalization error E as measured during numerical 
simulations (with = 1000 neurons) of three different learning rules: Hebbian (diamonds), 
perceptron (triangles) and AdaTron (squares). Initial state: E{0) = ^ (random guessing) 
and J(0) = 1. Learning rate: rj = 1. The solid lines give for each learning rule the prediction 
of the N = oo theory, obtained by numerical solution of the flow equations for (E, J). 

3 On-Line Learning: Complete Training Sets and Optimised 
Rules 

We now set out to use our macroscopic equations in 'reverse mode'. Rather than calculate 
the macroscopic dynamics for a given choice of learning rule, we will try to find learning rules 
that optimise the macroscopic dynamical laws in the sense that they produce the fastest 
decay towards the desired E = state. As a bonus it will turn out that in many cases we can 
even solve the corresponding macroscopic differential equations analytically, and find explicit 
expressions for E{t), or rather its inverse t{E). 

3.1 Time-Dependent Learning Rates 

First we illustrate how modifying existing learning rules in a simple way, by just allowing for 
suitably chosen time-dependent learning rates ??(t), can already lead to a drastic improvement 
in the asymptotic behaviour of the error E. 

We will inspect two specific choices of time-dependent learning rates for the perceptron 
rule. Without loss of generality we can always put r/(t) = K{t)J{t) in our dynamic equations 
(for notational convenience we will drop the explicit time argument of K). This choice will 
enable us to decouple the dynamics of J from that of the generalization error E. For the 
perceptron rule we subsequently find equation (|25| ) being replaced by 




Ksm{7rE) 

TTy/2TT 



K^E 



+ 



27r tan(7r^) 
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Figure 9: Evolution in time of the generalization error E as measured during numerical 
simulations (with = 1000 neurons) of three different learning rules: Hebbian (diamonds), 
perceptron (triangles) and AdaTron (squares). Initial state: E(0) ~ 0.025 and J(0) = 1. 
Learning rate: rj = 1. The solid lines give for each learning rule the prediction of the N = oo 
theory, obtained by numerical solution of the flow equations for {E, J). 



giving for small E 



dt 



In order to obtain — > for t ^ cxd it is clear that we need K ^ Q. Applying the ansatz 
E = A/f", K = B/t^ for the asymptotic forms in the previous equation produces 



-At 



+ + 0(t-2°-2/5) 



and so: a = /3 = 1 and A = Our aim is to obtain the fastest approach of 

'K\2'K (i> — v27r) 

the = state, i.e. we wish to maximise a (for which we found a = 1) and subsequently 
minimise A. Apparently the value of B for which A is minimized is i? = 2\/27r, in which case 
we obtain the error decay given by 

2J^/2^ 



r] 



E 



4 



it 



oo 



(31) 



This is clearly a great improvement upon the result for the perceptron rule with constant rj, 
i.e. equation (p6|); in fact it is the fastest relaxation we have derived so far. 



Let us now move to an alternative choice for the time-dependent learning rate for the 
perceptron. According to equation (p^) there is one specific recipe for ri{t) such that the 
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Figure 10: Asymptotic behaviour of the generalization error E measured during numerical 
simulations (with N = 1000) of three different learning rules: Hebbian (diamonds, middle 
curve), perceptron (triangles, upper curve) and AdaTron (squares, lower curve). Initial state: 
E{0) = ^ and J(0) = 1. Learning rate: t] = 1. The dashed lines give for each learning rule the 
corresponding power law predicted by the N = oo theory (equations (22,26,^), respectively). 



length J of the student's weight vector will remain constant, given by 

^ J 



TT E 



(l-cos(7r£;)) 



Making this choice converts equation (25) for the evolution of E into 



dt 



(l-cos(7r^))2 
iT^Esm{7rE) 



(32) 



(33) 



Equation ( |33[) can be written in the form -^t = g{E), so that t{E) becomes a simple integral 
which can be done analytically, with the result 



ttE + sin(7rii^) ttEq + sin(7r£'o) 
1 — cos(7rii^) 1 — cos(7r£^o) 



(34) 



(which can also be verified directly by substitution into (^)). Expansion of ( p4| ) and (32) for 
small E gives the asymptotic behaviour also encountered in (|3T|): 



V 



2JV27r 
t 



E 



TTt 



it 



OO 



(35) 



It might appear that implementation of the recipe (^) is in practice impossible, since it in- 
volves information which is not available to the student perceptron (namely the instantaneous 
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error E). However, since we know (^4|) we can simply calculate the required r/(i) explicitly 
as a function of time. 



One has to be somewhat careful in extrapolating results such as those obtained in this 



section. For instance, choosing the time-dependent learning rate (32) enforces the constraint 
j'^{t) = 1 in the macroscopic equations for A'^ — > oo. This is not identical to choosing 
in the original equation (|6|) such as to enforce j'^{t^ + j^) = J'^{t^) at the level of individual 
iteration steps, as can be seen by working out the dynamical laws. The latter case would 
correspond to the microscopically fluctuating choice 

If we now choose for example .F[J; Jx,y\ = 6[—xy\^ implying = 2| J(t^)-^'^|, we find by 
insertion into (^) that the perceptron rule with 'hard' weight normalisation at each iteration 
step via adaptation of the learning rate is identical to the AdaTron rule with constant learning 
rate r] = 2. We know therefore that in this case one obtains E ~ 3/2t, whereas for the 
Perceptron rule with 'soft' weight normalisation via ( |32[ ) (see the analysis above) one obtains 
E ~ i/irt. Apparently the two procedures are not equivalent. 

3.2 Spherical On-Line Learning Rules 

We arrive in a natural way at the question of how to find the optimal time-dependent learning 
rate for any given learning rule, or more generally: of how to find the optimal learning 
rule. This involves variational calculations in two-dimensional flows (since our macroscopic 
equations are defined in terms of the evolving pair (J, E)). Such calculations would be much 
simpler if our macroscopic equations were just one-dimensional, e.g. describing only the 
evolution of the error E with a stationary (or simply irrelevant) value of the length J. Often 
it will turn out that for finding the optimal learning rate or the optimal learning rule the 
problem can indeed be reduced to a one-dimensional one. To be able to obtain results also for 
those cases where this reduction does not happen we will now construct so-called spherical 
learning rules, where J'^{t) = 1 for all t. This can be arranged in several equivalent ways. 

The first method is to add to the general rule ^ a term proportional to the instantaneous 
weight vector J, whose sole purpose is to achieve the constraint = 1: 

^) = '^'>^^ + J, 

The evolution of the two observables Q[J] and R[J] (0) is now given by 

Q[j{t,+^)] = Q[Jit,m-^^) + ^vit,){j{t,ye) sgn{B.e)n\j{t,)\;j{t,ye,B-e] 

+ ^v\t,):F'[\J{t,)\;J{t,ye, B-e] + o{N-') 
R[Jit,+l^)] = i^[J(^^)](l-^) + ^r/(^^)|B•^^-^[|J(^M)l;'^(^.)•^^s•e] 

Following the procedure of section 1.2 to arrive at the N ^ oo limit of the dynamical equa- 
tions for Q and R then leads to (we drop explicit time arguments for notational convenience) : 

j^Q = 2rjQ^x sgn{y)J'[Qi;Q'2x,y])+7]^{J^^[Q'^;Q-2x,y]) - 2XQ 



j{t^ + -) = j{t^) + - {i^{t^)e sgniB-e)HJit^.)\;j{t^ye, B-ei - A(t^) j(t^)} (36) 
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j^R = r^{\y\T[Q"^;Q-^x,y])-\R 

We now choose the function \{t) such that Q{t) = 1 for ah t > 0. This ensures that 
R{t) = uj{t) = J{t)-B, and gives (via = 0) a recipe for X{t) 

X = ri{x sgn(?/)jr[l; x, y\) + ]^if{T'^[l] x, y\) 

which can then be substituted into our equation for ^uj: 

^uj = r]{[\y\ -u;x sgn(y)] x, y] - ^ajr]"^ {J^"^ [1; x , y]) 



(37) 



with averages as usual defined with respect to the Gaussian joint field distribution ([14D, which 
depends only on to, so that equation ( p7| ) is indeed autonomous. 

The second method to arrange the constraint = 1 is to explicitly normalise the weight 
vector J after each modification step, i.e. 



(38) 



= J{t,) + -r?(t^) { [e - J(V)( j(v)-e 
The evolution of the observable u;[J] = J B is thus given by 

u^[j{t,-h^)] = ^[j(v)]+^^(i/.) { [\B-e\ - u;[j{t,)]{j{t^ye) sgn{B.e)] ^[1; j{t,ye, B-e] 

-^u;v{t^.)J=-^[l;J{t^ye,B-e]] + 0{N-^) 



Following the procedure of section 1.2 then leads to 



^uj = r]{[\y\ -wx sgn(y)] J^[l;x,y] - ^ajr]"^ {J^"^ [1; x , y]) 



(39) 



which is identical to equation (|37]). 

Finally we convert equation ( |37| ) into a dynamical equation for the error E, using (15), 
which gives the final result 



-E = 

dt TT sin(7r£') 



{[\y\ - cos{-kE)x sgn{y)] J^[l; x, ; 



2TTtan{TTE) 



{T^[l;x,y]) (40) 



with averages defined with respect to the distribution (14), in which uj = cos{ttE). 

For spherical models described by either of the equivalent classes of on-line rules ( |36| ) 
or (^) the evolution of the error is described by a single first-order non-linear differential 
equation, rather than a pair of coupled non-linear differential equations. This will allow us 
to push the analysis further, but the price we pay is that of a loss in generality. 
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3.3 Optimal Time-Dependent Learning Rates 



We wish to optimise the approach to the E = state of our macroscopic equations, by 
choosing a suitable time-dependent learning rate. Let us distinguish between the possible 
situations we can find ourselves in. If our learning rule is of the general form @, without 
spherical normalisation, we have two coupled macroscopic equations: 



dt 



d Tj 
— J = ri{x sgn(y)jr[J; Jx,y]) + ^i^'^i'^'^ 



-{[\y\-cos{TTE)x sgn{y)]J^[J;Jx,y]) + 



(41) 



-{J^^[J;Jx,y]) (42) 



J7rsin(7r£;) ^-"^"^^^J- ' 27rJ'^tan{TTE) 

which are obtained by combining ( p!^JT^ ) with (15). The probability distribution (|l^ with 
which the averages are computed depends on E only, not on J. If, on the other hand, we 
complement the rule (^) with weight vector normalisation as in ( |36|) or (^) (the spherical 
rules), we obtain a single equation for E only: 



dt 



vr sin(7r£') 



{[\y\ - cos{ttE)x sgn{y)] j^[l; x, y]) + 



2-Kicm{TiE) 



{T^[l-.x,y]) (43) 



Since equation (|4^ ) is autonomous (there are no dynamical variables other than E), the 
optimal choice of the function fj{t) (i.e. the one that generates the fastest decay of the error 
E) is obtained by simply minimising the temporal derivative of the error at each time-step: 



Vt > 



d 



dfi{t) 



d_ 

dt^ 



-E 







(44) 



which is called the 'greedy' recipe. Note, however, that the same is true for equation (42) if we 
restrict ourselves to rules with the property that J^[J; Jx, y] = ^{J)T[1; x, y] for some function 
7(J), such as the Hebbian (7(J) = 1), perceptron (7(J) = 1) and AdaTron (7(J) = J) rules. 
This property can also be written as 



d T[J;Jx,y] d J^[J;Jx,y] 



dx J='[l;x,y] dy J^[l;x,y] 







(45) 



For rules which obey ( |45| ) we can simply write the time-dependent learning rate as 77 
fjJ/^^J), such that equations (41,4^) acquire the form: 



^log J = fjix sgn(y)J^[l;x,y]) + ^fi'^{J^'^[l;x,y]) 



(46) 



d 



E 



V 



-([|y|-cos(7r£^)x sgn{y)] x,y]) + 



-{J'^[l;x,y]) (47) 



dt 7rsin(7r£^) ^ ^ " ^ ' ' 27rtan(7r£') 

In these cases, precisely since we are free to choose the function fi{t) as we wish, the evolution 
of J decouples from our problem of optimising the evolution of E. For learning rules where 
J^[J; Jx, y] truly depends on J, on the other hand (i.e. where (^5|) does not hold), optimisation 
of the error relaxation is considerably more difficult, and is likely to depend on the particular 
time t for which one wants to minimise E{t). We will not deal with such cases here. 
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If the 'greedy' recipe applies (for spherical rules and for ordinary ones with the property 



|45|)) working out the derivative in (44) immediately gives us 



opt 



{{\y\-cos{TTE)x sgn{y)}J^[l;x,y]) 
cos{TTE){J^^[l;x,y]) 



Insertion of this choice into equation (40) subsequently leads to 



dE 
It 



opt 



{{\y\-cos{TTE)x sgn(y)}.7^[l;x, j/])^ 
2Trsm{-KE)cos{TTE){J'^[l;x,y]) 



(48) 



(49) 



These and subsequent expressions we will write in terms of fj, defined as fj{t) = rj{t) for the 
spherical learning rules and as ry(t) = ■q{t)J{t)/^{J{t)) for the non-spherical learning rules. 
We will now work out the details of the results ( ^ , ^ ) upon making the familiar choices for 
the function T[. . .]: the Hebbian, perceptron and AdaTron rules. 

For the (ordinary and spherical) Hebbian rules, corresponding to J-[J;Jx,y] = 1, the 
various Gaussian integrals in ( p8| , ^9[ ) are the same as those we already did (analytically) 
in the case of constant learning rate r]. Substitution of the outcomes of the integrals (see 
appendix) into the equations (p^j49|) gives 



^opt 



2 sin^inE) 
vr cos(7r£') 



dE 



opt 



sin^(7ri?) 
vr^ cos{ttE) 



The equation for the error E can be solved explicitly, giving (to be verified by substitution): 

t{E) = ^7rsin~2(^^) _ ivr sin"^^^^^) (50) 

The asymptotic behaviour of the process follows from expansion of ( |50|) for small E, and 
gives 

1 PiF 1 

{t oo) 



Vopt 



Fl 
2t 



Asymptotically there is nothing to be gained by choosing the optimal time-dependent learning 
rate, since the same asymptotic form for E was also obtained for constant rj (see (p^)). Note 
that the property J^[J; Jx, y] = J^[l; x, y] of the Hebbian recipe guarantees that the result ( |50| ) 
applies to both the ordinary and the spherical Hebbian rule. The only difference between the 
two cases is in the definition of fj: for the ordinary (non-spherical) version fj^t) = rj{t)/J{t), 
whereas for the spherical version fj{t) = r]{t). 

We move on to the (ordinary and spherical) perceptron learning rules, where J-[J; Jx, y] = 
6[—xy], with time-dependent learning rates 7/(t) which we aim to optimise. As in the Hebbian 
case all integrals occurring in (4^,4^ upon substitution of the present choice J^[J]Jx,y] = 
6[—xy] have been done already (see the appendix) . Insertion of the outcomes of these 



integrals into (45,49) gives 



^opt 



sin2(7r^) 



27r£;cos(7r£^) 



dE 
It 



Slit 



(ttE) 



opt 



47r2£;cos(7rii;) 
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Again the non-linear differential equation describing the evolution of the error E can be 
solved exactly: 



t{E) 



2[ttE + sm{TTE) cos(7r^)] 2[kEo + sin(7r£;o) cos(7r£;o)] 



s'm^i-KE) sin^inEo) 
Expansion of ( |5T|) for small E gives the asymptotic behaviour 



(51) 



Eopt ~ — 

TTt 



Vopt 



t 



it 



oo 



which is identical to that found in the beginning of this section, i.e. equations ( pHp5| ), upon 
exploring the consequences of making two simple ad-hoc choices for the time-dependent 
learning rate (since fj = rj/J). As with the Hebbian rule the property J^[J; Jx, y] = J^[l; x, y] 
of the perceptron recipe guarantees that the result (^) applies to both the ordinary and the 
spherical version. 

Finally we try to optimise the learning rate for the spherical AdaTron learning rule, 
corresponding to the choice T[J; Jx, y] 



\Jx\6[—xy]. Working out the averages in 
again does not require doing any new integrals. Using those already encountered in analysing 
the AdaTron rule with constant learning rate (to be found in the appendix), we obtain 



Vopt -- 

dE 
It 



sin'^(7r£') 



vr 



EcosiirE) 



cos (ttE) sin(7r£') 



1 -1 



vr 



opt 



sin^(vri?) 
'2vr2 cos(vr£;) 



ttE — cos(vrii^) sin(vr£^) 



(note that in both versions, ordinary and spherical, of the AdaTron rule we simply have 
fj{t) = rj^t)). It will no longer come as a surprise that also this equation for the evolution of 
the error allows for analytical solution: 



vr 



4vr-E - sin(4vr£') 4vr£^o - sin(4vr£'o) 



sin4(vr^) sin4(vrSo) 

Asymptotically we find, upon expanding (|5^ ) for small E, a relaxation of the form 

4 . 3 

3t ~ 2 



(52) 



E. 



opt 



oo 



So for the AdaTron rule the asymptotic behaviour for optimal time-dependent learning rate 
T] is identical to that found for optimal constant learning rate tj (which is indeed r] 



2' 



see 



(pjOj)). As with the previous two rules, the property J^[J; Jx, y] = JT[1; x, y] of the AdaTron 
recipe guarantees that the result ([50|) applies to both the ordinary and the spherical version. 



It is quite remarkable that the simple perceptron learning rule, which came out at the 
bottom of the league among the three learning rules considered so far in the case of having 
constant learning rates, all of a sudden comes out with 'douze points' as soon as we allow for 
optimised time-dependent learning rates. It is in addition quite satisfactory that in a number 
of cases one can actually find an explicit expression for the relation t{E) between the duration 
of the learning stage and the generalization error achieved, i.e. equations (34,5C,51,52). 
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3.4 Optimal On-line Learning Rules 

We need not restrict our optimisation attempts to varying the learning rate rj only, but we can 
also vary the full form rjT[J] Jx, y] of the learning rule. The aim, as always, is to minimise 
the generalisation error, but there will be limits to what is achievable. So far all examples of 
on-line learning rules we have studied gave an asymptotic relaxation of the error of the form 
E ~ t~'^ with q < 1. It can be shown using general probabilistic arguments that if one only 
has p = aN examples of randomly drawn question/answer pairs {(.^ ,T{^^)} with which to 
calculate the weight vector J of an A^-neuron binary student perceptron (whether in an on-line 
or a batch fashion), the generalisation error Eg{J) obeys the inequality Eg{J) > 0.44. . . /a 
for N ^ oo (this is the one result we will mention without derivation). For on-line learning 
rules of the class (P) or (36,38) we have used at time t a number of examples p < tN, so this 



inequality translates into 

lim tE{t) > 0.44... (53) 



t— >oo 



No on-line learning rule can violate (53)^. On the other hand: we have already encountered 
several rules with at least the optimal power E ~ t~^. The optimal on-line learning rule is 
thus one which gives asymptotically E ~ A/t, but with the smallest value of A possible. 

The function J-[J; Jx, y] in the learning rules is allowed to depend only on the sign of the 
teacher field y = S-^, not on its magnitude, since otherwise it would describe a situation 
where considerably more than just the answers T(^) = sgn[S-^] of the teacher are used for 
updating the parameters of the student. One can easily see that using unavailable information 
indeed violates (|53[). Suppose, for instance, we would consider spherical on-line rules, i.e. ( |3^ ) 
or (^), and make the forbidden choice 

^ M^cos(^^)xs^ 

COS(7r£/) 



We would then find for the corresponding equation ( |43D describing the evolution of the error 
E ioi N ^ oo: 

d ^ _ {[\y\-cos{iTE)x sgn(y)]^) 



dt 2-71 sin{7r E) cos {it E) 

(with averages as always calculated with the distribution (|l^)) from which it follows, upon 
using the Gaussian intregrals done in the appendix: 

d ^ tan{iTE) 
dt ~ 2tt 

This produces exponential decay of the error, and thus indeed violates 



Taking into account the restrictions on available information, and anticipating the form 
subsequent expressions will take, we write the function J-'[J; Jx, y] (which we will be varying, 
and which we will also allow to have an explicit time-dependence|^ ) in the following form 

' ^ [ JJ^^{x,t) ify<0 ^ ^ 



®This will be different for graded-response perceptrons. 



^"By allowing for an explit time-dependence, we can drop the dependence on J in J-[J; Jx,y] if we wish, 
without loss of generality, since J is itself just some function of time. 
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If our learning rule is of the general form (^), without spherical normalisation, the coupled 
equations (41,^) describe the macroscopic dynamics. For the spherical rules (36,38) we have 
the single macroscopic equation (H^). Both (^2[) and (43) now acquire the form 



dt 7rsin(7r£') 



{{y-ujx)e[y]j'+{x,t)) - {{y-ujx)eUj]J'.{x,t)) 



(55) 



with the usual short-hand u = cos{tiE) and with averages calculated with the (time-dependent) 
distribution (p!4[). To simplify notation we now introduce the two functions 

dy e[y]P{x, y) = ^{x, t) j dy e[y][y-0Jx)P{x, y) = A(x, t) 

and hence, using the symmetry Pt{x,y) = Pt{—x,—y), equation ( [55| ) acquires the compact 
form 

1^ = -—^Jdx { A(x, t):F,ix, t) - i.O(x, t)J^l[x, t)] 

r\-^ [dx\A{-x,t)J^-{x,t)) -lu;ni-x,t)J^l{x,t)] (56) 

7rsm(7r£/j J { 2 J 

Since there is only one dynamical variable, the error E, our optimisation problem is solved 
by the 'greedy' recipe which here involves functional derivatives: 



Vx, Vt : 

with the solution 

J^+(x, £ 



5!F+{x, t) 
A{x,t) 



dE 
'dt 



6T-{x,t) 



dE 







U!Q{x, t) 



^ , X A(— x,t) ^ , 
T-{x,t)= \ \ =.F+(-x,t) 
t(;SZ(— X, t) 

Substitution of this solution into ( |56| ) gives the corresponding law describing the optimal 
error evolution of (ordinary and spherical) on-line rules: 



dE 
'dt 



dx 



A2(x,f) 



opt vr sin(7r£') cos(7r£^) 7 0,{x,t) 
Explicit calculation of the integrals A(x,t) and r2(x,t) (see appendix) gives: 



A(x,t) 



Sin(7r£^)„_l^2/sin2{7ri?) 



2-IT 



-e 2- 



n{x,t) 



e 2^ 



2V27r 



1-herf (x/V2tan{7rE) 



with which we finally obtain an explicit expression for the optimal form of the learning rule, 
via (|53), as well as for the dynamical law describing the corresponding error evolution: 



r]J^[J;Jx,y]opt 



2 Jtan(7rE)e-^^''/*^°'(^^) 
^ 1-|- sgn(xy)erf ^|x|/\/2 tan(7r-E) 



(57) 
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Figure 11: Evolution of the error E for three on-hne learning rules: Perceptron rule with 
a learning rate such that J{t) = 1 for all t > (solid line), Perceptron rule with optimal 
learning rate (dashed line) and the optimal spherical learning rule (dotted line). Initial state: 
-E'(O) = ^ and J(0) = 1. The curves for the Perceptron rules are given by ( p^ and ( p^. T he 
curve for the optimal spherical rule was obtained by numerical solution of equation ((^8]). 



dE 
'dt 



= / dx ^ (58) 

opt vr2^ J l+erf(x/V2) 

The asymptotic form of the error relaxation towards the E = state follows from expansion 
of equation (|5^) for small E, which gives 

dE „o /■ , e-y' 



, - ^ / ^ + OiE'') 

dt J ^ V2^[l + eri{y/V2)] ^ ' 



so that we can conclude that the optimum asymptotic decay for on-line learning rules (whether 

A 
t 



spherical or non-spherical) is given by ~ 4 for t — > (X), with 



A-^ = I dx 



27r[l -Ferf(2;/^/2)] 

Numerical evaluation of this integral (which is somewhat delicate due to the behaviour of the 
integrand for y -<xi) finally gives 

0.883... 

E ~ {t oo) 

It is instructive to investigate briefly the form of the optimal learning rule ( |57|) for large values 
of E (as in the initial stages of learning processes) and for small values of E (as in the final 
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Figure 12: Evolution of the error E for the on-hne Perceptron rule with a learning rate such 
that J{t) = 1 for ah t > (sohd line), the on-line Perceptron rule with optimal learning 
rate (dashed line) and the optimal spherical on-line learning rule (dotted line). Initial states: 
{J,E) = (1,^) (upper curves), and {J,E) = (1,]^) (lower curves). The curves for the 
Perceptron rules are given by ( |3^ ) and (^l|). The curves for the optimal spherical rule were 
obtained by numerical solution of equation (^3) . 



stages of learning processes). Initially we find 



lim 



r]J^[J;Jx,y] 



opt 



tan(7r£') 



V vr 



which describes a Hebbian-type learning rule with diverging learning rate (note that tan^irE) — 
oo for E ] ^). In contrast, in the final stages the optimal learning rule ( ^7|) acquires the form 



lim r]J='[J;Jx,y]opt 



J\x\ 



vr 



-xy\ 



lim 



oo z [1— erf(z)] 



J\x\6[—xy] 



which is the AdaTron learning rule with learning rate rj = iP]. 



In figures |Tl| (short times and ordinary axes) and |T2| (large times and log-log axes) we 
finally compare the evolution of the error for the optimal on-line learning rule ( |57| ) with 
the two on-line learning rules which so far were found to give the fastest relaxation: the 
perceptron rule with normalising time-dependent learning rate (giving the error of (|3^)), and 



^^The reason that, in spite of the asymptotic equivalence of the two rules, the optimal rule does not 
asymptotically give the same relaxation of the error E as the AdaTron rule is that in order to determine the 
asymptotics one has to take the limit i5 — > in the full macroscopic differential equation for E, which, in 
addition to the function T[. . .] defining the learning rule, involves the Gaussian probability distribution ( p^ 
which depends on _E in a non-trivial way, especially near _E = 0. 
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the perceptron rule with optimal time-dependent learning rate (giving the error of (|5TD). 
This in order to assess whether choosing the optimal on-line learning rule (57) rather than 
its simpler competitors is actually worth the effort. The curves for the optimal on-line rule 
were obtained by numerical solution of equation ( |58|) . 



3.5 Summary in a Table 

We close this section with an overview of some of the results on on-line learning in perceptrons 
described/derived so far. The upper part of this table contains results for specific learning 
rules with either arbitrary constant learning rates r] (first column), optimal constant learning 
rate 77 (second column), and where possible, a time-dependent learning rate r]{t) chosen such 
as to realise the normalisation J{t) = 1 for all t. The lower part of the table gives results 
for specific learning rules with optimised time dependent learning rates ??(t), as well as lower 
bounds on the asymptotic generalization error. 



GENERALIZATION ERROR IN PERCEPTRONS WITH ON-LINE LEARNING RULES 




Constant learning rate r] 


Variable 77 


Rule 


Asymptotic decay for constant r] 


Optimal asymptotic decay for 
constant 77 


77 chosen to 
normalise J 


Hebbian 


E ~ for 77 > 


E ~ ^^1/2 for 77 > 


N/A 


Perceptron 


E ~ (|)i/3^~it~i/3 for 7/ > 


E ~ (|)V37r-it-i/3 for 77 > 


E^H~^ 

TT 


AdaTron 


E ~ (3^)*-' for < 7/ < 3 


E ~ fr^ for 77 = 1 


E r. 3,-1 


OPTIMAL GENERALIZATION 




Optimal time-dependent learning rate 77 


Rule 


Generalization error for optimal time-dependent rj 


Asymptotics 


Hebbian 


^ _ TT r 1 1 1 
2Lsin^{7r£;) sm^{7r£;o)^ 




Perceptron 


J. r)f 7r-E+sin(7r_B) cos(7r_B) 7r_Eo+sin{7rii;o) cos(7r_Eo) i 
^ 1- sin^{7r£;) sm^{7r£;o) ^ 


E^^t-^ 

71" 


AdaTron 


, 7r r47rE-sin(47r_B) 47r_Bo-siii{47r£;o) 1 
~ 8 ^ sin4(7rE) sin'*(7r£;o) ^ 


E ~ 


Lower bound for on-line learning (asymptotics of the optimal learning rule) 


E ~ 0.88t-^ 


Lower bound for any learning rule 


E ~ 0.44t-i 
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4 The Formal Approach 



The main reason for developing a more formal approach to learning dynamics is that in 
the complicated cases of incomplete training sets or layered systems with large numbers of 
hidden neurons we can no longer get away with the relatively simple methods used so far. For 
perceptrons with N inputs the situation of incomplete training sets arises when the number of 
'questions' scales as \D\ = aN. We show how in the limit N ^ oo the dynamics of any finite 
set of mean-field observables will be described by a (macroscopic) Fokker-Planck equation, of 
which the flow- and diffusion terms can be calculated explicitly. In addition the more formal 
analysis will allow us to recover the previous results on on-line learning in a more rigorous 
way, and will clarify the relation between the macroscopic laws for the on-line and batch 
scenarios. 

4.1 From Discrete to Continuous Times 

We will describe the formal procedure for calculating macroscopic dynamical laws from the 
microscopic ones for on-line and batch learning processes in simple perceptrons. It involves 
several distinct stages. Our starting point is the formulation (|2|) in terms of a Markov process: 

Pm+i{J) = J dJ' W[J; J']pm{J') (59) 

with transition probability densities corresponding to the class (P) of generic learning rulesF^ 



On- Line : W[J;J'] = {5\j-J'-^$ sgn(S • [| J'|; J'-^, S-^] 

(60) 

Batch: W^[J; J'] =5|j-J'-i(^sgn(S-0^[|J'|;J'-^,S-^])^| 

Note that in the previous approach the limit — > oo realised several simplifications at 
once (continuous versus discrete time, stochastic versus deterministic macroscopic evolution) 
which for technical reasons we would prefer to control independently. 



We will first describe a method to make the transition from the discrete-time process (59) 
to a description involving real- valued times in a more transparent and exact way. The idea is 
to choose the duration of each discrete iteration step in the process ( p9|) to be a real-valued 
random number, such that the probability that at time t precisely m steps have been made 
is given by the Poisson expression 



ml 



with the properties 



^vr^>o(t) = N[TTm-iit)-TT^{t)] ^Mt) = -NMt) (61) 

at at 

(m) = Nt {m^)-{mf = Nt (62) 



^■^ As before this is just one choice of many. We could e.g. easily add a term of the form J'/C[| J'|; J' i?-^] 
to account for weight decay (constant, 'hard' spherical, 'soft' spherical, or otherwise), without making the 
analysis significantly more difficult. 
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This move at first sight appears to make the problem more comphcated, but will turn out to do 
precisely the opposite. From (|6^ ) it follows that for times t <^ N one has t = m/N + 0{N^2^^ 
the usual time unit. Due to the random durations of the iteration steps we also have to replace 
the microscopic probability distribution Pmicr) in (|59| ) by one that takes the variations in 
numbers of iteration steps performed at a given time t into account: 

Pt{J) = ^ ■Km{t)i>m{J) (63) 
m>0 

This distribution obeys a simple differential equation, which follows from combining the 

±p,{J)=NjdJ' {W[J;J']-6[J-J']}pt{J') (64) 



equations (|59,|61,52 



So far no approximations have been made, equation (34) which replaces (|5^) is exact for any 
N. We have made the transition from discrete-time iterations to differential equations (which 
are usually much easier to handle) without invoking the limit — > oo, but at the price of 
an uncertainty in where we are on the time axis. This uncertainty, however, is guaranteed to 
vanish in the limit ^ oo. 

4.2 From Microscopic to Macroscopic Laws 

We next wish to investigate the dynamics of a number of as yet arbitrary macroscopic ob- 
servables fi[J] = (Oi[J], . . . ,Qk[J])- They are assumed to be 0(1) each for N — > oo, and 
finite in number. To do so we introduce the associated macroscopic probability distribution 

Pt{ft) = J dj pt{j)6 [n - n[j]] (65) 

Its time derivative immediately follows from that in (|6^): 

Pt{n) = N JdJdJ' 6 [n-n[J]] {W[J; J']-6[J -J']} ptiJ') 
This equation can be written in the standard form 

j^Ptm = J dn' Wt[n; fl']Pt{ft') (66) 



d_ 
di' 



d 

where 



_ jdj' pt{j')5 jdJ5 [n-n[j]] n {w[J; j']-5[j-j']} 
■^^^^ ' ^" jdj' pt{j')6[n'-n[j']] 

(this statement can be verified by substitution of yVt[r2; ft'] into ([6^)). Note that the macro- 
scopic process (^) need not be Markovian, due to the explicit time-dependence of the 
macroscopic transition density >Vi[ri;ri']. If we now insert the relevant expressions (6C) for 
Ty[J;c7'], we can perform the J-integrations, and obtain an expression in terms of so-called 
sub-shell averages (or conditional averages) {fiJ))fi.t, which are defined as 

,f(j.. _ IdJ pt{J)6[n-n[j]]f{j) 
j^j p^^j^^ [n-n[j]] 
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For the two types of learning rules at hand (on-line and batch) we obtain: 



Wti^-M'] =N{6 



We now insert integral representations for the (5-distributions 



-5[n-n[j]])^> 



S[n-Q] 



which gives for our two learning scenario's: 



^iQ in Q] 



(27r)^ 



ID 



(67) 



bat 



Iff ft'] = fJ^e'^-^ ^/g-in.n[J+i(^sgn(S.^)^[|J|;J.^,S.^]>5]_ -in.n[J]\ , 



(68) 

Still no approximations have been made. The above two expressions differ only in at which 
stage the averaging over the training set D occurs. 



Our aim is to obtain from (66) an autonomous set of macroscopic dynamic equations, i.e. 
we want to choose the observables ft[J] such that for N ^ oo the explicit time-dependence in 
yVt[fi; ft'], induced by the appearance of the microscopic distribution pt{J) will vanish. This 
can happen either because pt{J) drops out, or because pt{J) depends on J only via fl[J], or 



even through combinations of these mechanisms. In expanding equations (67,|68D for large N 
we have to be somewhat careful, since the system size N enters both as a small parameter to 
control the magnitude of the modification of individual components of the weight vector J, 
but also determines the dimensions and lengths of various vectors. Upon inspection of the 
general Taylor expansion 



, N N 



e>o 41=1 



ie=l 



we see that if all derivatives were to be treated as 0{\) (i.e. if we only take into account the 
dependence of the components of k on N , we end up in trouble, since in the cases of interest 
(where = 0{N-^)) this series could give Q.[J+k] = E^>o(Ei hf = T,i>o We need 

to restrict ourselves to observables f^/xi"^] of the mean- field type, where all compoments Jj 
play an equivalent role in determining the overall scaling with respect to N (which makes 
sense). For instance: 



^[J] = Efc BkJk : 0{din[J]) = 0{Bi) = N-^0[n[J])/0[J^) 

^[J\ = Ek Jl ■■ o{d^a[j]) = o{Ji) = N-^o{n[j\)/o{Ji) 

^[J] = Eki JkAkiJi : 0{d,a[J]) = 0(Efc AikJk) = N-^0{n[J])/0{Ji) 
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The pattern is clear. The only additional point to be taken into account is that in the case 
of multiple derivatives with respect to the same component Jj, our scaling requirement will 
be less severe due to the fact that such terms occur less frequently than multiple derivatives 
with respect to different components (i.e. in J2ij Ji-^ijJj we have A^(A^— 1) terms with i ^ j, 
but just N with i = j). We thus define mean-field observables 17 [J] as 



mean— field observables : 



d^n[j] 



dJi^ ■ ■ ■ dJi^ 



{N oo) (69) 



in which d is the number of different elements in the set {zi, . . . , i^}. For mean- field observables 
we can estimate the scaling of the various terms in the Taylor expansion: 



dJidJj 



i>3 



\k\ 



(70) 



(where we have used J2i = 0{\' N\k\)). 

We now apply ([70|) to our equations ( |67|j6^) , restricting ourselves henceforth to mean- field 
observables r^^^fJ] in the sense of (|69|). The shifts k, being either -^^ sgn{B-^)T[\J\; J-^, B-^] 
or -^(^ sgn{B-^)J^[\J\- J-^, B-^]) j^, scale as |fc| = 0{N~2). Furthermore, if we choose one 
of our observables to be 0.i[J] = J^, the subshells in (|67|,|68D will ensure = 0(1), so that 



the i-th order term in the expansions (70) will be of order 2 in both cases. This allows 
us to expand: 



^-ifl Q,[J+k] _g-iri-fJ[J] 



(7V-§)} 



ifl-ft[J] 



92 - 1 



d 



n 2 



Y^k,—{(i-n[j]) 



so that 
N 



+ 0{N~2) 



(2vr) 



^~ift-Q,[J+k] _^-ift-Q,[J] 



-N 



ia\n-n\j]] 



d,hdJ, ■ 2 



dJi 



dJi 



WN 



-N 



dn 

(2^ 



d 



E 



dJidJj 



E 



d 



dJidJj 



4i: 



92 



E 



X 5[n-ft[J]] + 0{N-^) 
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We now find, upon insertion of this expansion into tlie expressions ( p7\) and (pq), that both 
types of learning dynamics (on-line and batch) are described by macroscopic laws with tran- 
sition probabilities of the general form 



5[n- n'] 



which, in combination with the dynamic equation (6£), leads to convenient and transparent 
description of the macroscopic dynamics in the form of a Fokker-Planck equation: 



dt 



k 

E 



1 



{F^[n-t]Ptm + - E 



d 



{G^,[n-t\Ptm] (71) 



(modulo contributions which vanish for ^ oo). The differences between on-line and batch 
learning are in the explicit expressions for the functions F^[^]t\ and G^i/[fl;t] in the flow- 
and diffusion terms. Upon introducing the short hand J-[. . .] for J|; J-^,B-^ these can 
be written as: 



F^^ill; t] = sgn(B.O^[. • ■f-^)!,)^; + |^((E ^.^^^i- • 



dJi 



(72) 

(73) 



+ ^(E(e. sgn(S.^).^[. . sgn(S.On . ■])d^^^ 



dJidJi 



D 



(74) 



The result (71) is still fairly general. The only conditions on the observables ^^^[J] needed 
for ( [7l| ) to hold are {i) all are of order unity for N ^ oo, (ii) all are of the mean- field type 
(|69|), and (iii) one of them is the squared length of the student's weight vector. 



The Fokker-Planck equation ( [7l| ) subsequently quantifies the properties of the ideal choice(s) 
for our macroscopic observables $7^ [J], if our aim is to find closed deterministic equations. 
Firstly: 

deterministic laws : lim Gau['^',t]=0 (76) 

If (^) holds, equation (|7l|) reduces to a Liouville equation, with solutions of the desired form 
Pt{^) = 6 [CI— ft* (t)] in which the trajectory ft*{t), in turn, is the solution of the deterministic 
equation 

fn = F[n-t] (77) 
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with the flow field F given either by ( |74 ) (for on-line learning) or by ( [7^ ) (for batch learning) . 
Note that condition (|7^ ) is not only sufficient to guarantee deterministic evolution, but also 
necessary. Secondly, we want the deterministic laws to be closed: 

d 

closed laws: lim — = (78) 

N^oo Ot 

(again this condition is sufficient and necessary). A set of mean- field observables 
meeting the criteria ( ^ , [7^ ) constitutes for — > oo an exact autonomous macroscopic level 
of description of the learning process, in the form of the coupled deterministic differential 
equations (77). However, in general there will be no a priori guarantee that such a set of 



observables actually exists. 

4.3 Application to {Q, R) Evolution 

We now apply the general results of this section to the specific duo of observables that we 
considered in the previous sections to describe on-line learning with complete training sets: 

n^lj] = Q[j] = n2[J] = R[J] = J ■ B (79) 

These observables are indeed of the mean- field type (]69| ) if all Bi = 0{N^2 ), and are defined 
to be of order unity. However, the training set D is now chosen to consist of \D\ = aN 
randomly drawn questions G {—1,1}^. We will show that Q and R obey deterministic 
macroscopic equations for any a. These equations, however, fail to close as soon as the 
training set is incomplete (for a < oo). In contrast, our previous results are recovered for 
the case of complete training sets (for a ^ oo). In addition we will derive for the case of 
complete training sets the macroscopic equations for the batch version of some of the most 
popular learning rules. 

As could have been expected, we will also need the joint input distribution 

P{x,y) = {{5[x-J-my-B-^])f,)Q,R;t (80) 



Note that we cannot simply assume the distribution ( |80D to be of a Gaussian form; it will 
depend on a. We will now first show that the second order moments of P{x, y) remain finite 
for any a in the limit A — > oo. For arbitrary vectors x and y we find 



{{x-^){y-^))Q = x-y + L{x,y) L{x,y) = ^Xiyj 



, aN 

aN ^ 



The second term is bounded according to Amin|^c||y| < L{x,y) < Amaxla^Hyl, in which the 
X's denote the (real) eigenvalues of the matrix Mij = ^^f-J2'^=iCiCj- For large A^ the 
spectrum of the matrix M can be calculated using random matrix theory (see e.g. Q) and 
the eigenvalues will be bounded: 

eigenvalues of M,, = ^ ^ ^^^f : | « ^ ^ ^ = " " ^ ^ 



(81) 
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From this it follows that all second order moments (and therefore also all first order moments) 
of the distribution P{x,y) are finite, whatever the value of a, but also that only for a ^ oo 
we recover the familiar previous expressions (derived for complete training sets in section 2) 
for the second-order moments in terms of Q and R: 

Q — > oo : / dxdy x'^P{x,y) = / dxdy y'^P{x,y) = 1, / dxdy xyP{x,y) = R/Q (82) 



The next stage is to assess the scaling of the various diffusion terms G 



in the Fokker- 
oo if our observables are to behave 



Planck equation (71). These should vanish for N 
determistically in the limit N ^ oo. For the present observables the diffusion terms (7^,75) 
become 



G'qq[Q,R; t] 
G'qr[Q,R] t] 
G'rr[Q-,R] t] 



^QJdxdy Pix,y)x^J^^[Q-2;Q'2x,y] 

J dxdy P{x, y)xyT'^ [Q5 ; Qix, y] 
^Jdxdy P{x,y)y^J^^[Q'2;Qhx,y] 



[fdxdy P{x,y)x sgn{y)J^[Ql;Qlx,y]^ 
^dxdy P{x,y)x sgn{y)T[Q^;Q^x,y]} {jdxdy P{x,y)\y\T[Q^;Q^x,y] 
^{jdxdy P{x,y)\y\J^[Q'2;Qhx,y]y 



We conclude that all diffusion terms G*** are of order 0{jj) provided . .] is bounded (which 
we assumed from the start). This implies that for N ^ oo our macroscopic observables Q 
and R indeed evolve deterministically for any a > 0. 

The resulting deterministic equations for the duo {Q, R) for on-line and batch learning 
are given by combining (77) with the flow terms (72) and (|74|), respectively. These equations 
we now work out explicitly, starting with the on-line scenario. Insertion of (j7^) into ( [77| ) 
gives 

(83) 

d f 11 

—R= lim T] dxdy P(x,y)\y\J^[Q2]Q2x,y] (84) 

dt N^oo J 

Note that these equations are of the same form as those derived earlier for complete training 
sets, i.e. (|9|,^). The differences between complete and incomplete training sets are purely in 
the joint distribution P{x,y), i.e. equation (| 



Working out the macroscopic equations for the case of batch learning is somewhat less 
straightforward, although the final result will be simpler. Insertion of (74) into ( [77|) gives, 
with the usual short-hand J^[. . .] = J^[\J\; J-$, B-^]: 

j^Q = Jim_ 1 2?7Q5 / dxdy P{x, y)x sgn(y).F[Q5; qI^;, y] + ^(^(^^ sgn(S-^).F[. . ] 



N 
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-^i?= lim 7] [dxdy P{x,y)\y\J'[Q2;Qlx,y] 
at N^oo J 

The second term in the temporal derivative of Q can be written as the subshell average of a 
quantity of the form 



1 



aN 



E E -.ere- 



IJ,U=1 



X 



+ K{x) 



K{x) 



1 



aN 



with x"^ = 0(1) for N — > cxd. The second term in this expression is bounded according to 
X^i^x'^ / aN < K{x) < Amax^J^/aA^, in which the X' s denote the (real) eigenvalues of the 
matrix M,iv = ^-Ir^ yii^i it^Y ■ Note that for N ^ oo the eigenvalues of the matrix M are 

1/a (since the relation 
.K{x) = 
oo. 



related to those of the matrix M in ( |8l]) by simply replacing a 
between the two cases is interchanging aN and N). From this it follows that limAr_ 
and that in the temporal derivative of Q only the first term survives the limit 
This leaves the final result: 



4q= lim 2r]Ql 

dt N~*oo 



dxdy P{x,y)x sgn{y)J^[Q2-Q2x,y] 



dt 



N- 



lim T] dxdy P{x,y)\y\J^[Q2;Q2x,y] 



(85) 



(86) 



For any value of a, the difference between the macroscopic equations for on-line learning 

(apart from a possible difference in the expressions one 



(|8^j8j) and batch learning (|85|| 
might find for the distribution P{x,y)) is simply the presence/absence of terms which are 
quadratic in the learning rate rj. 



For finite a, the case of incomplete training sets, we observe that the macroscopic equa- 
tions for the pair {Q,R) (i.e. (8^,8^) and (85,S6|)) do not close, since the distribution P{x,y) 
( pO| ) need not be of a Gaussian form, and its moments need not (and almost certainly will 
not) be expressible in terms of the quantities Q and R. 

For a oo, the case of complete training sets, we can express the second order moments 
of P{x,y) (^) in terms of the observables {Q,R) via (|8^). Moreover, we can show that the 
first order moments of P{x,y) are zero, since for any normalised vector x S SR^: 




< 




7(0 



in which 7(^) obeys (with brackets denoting averages over the possible training sets): 



(7^(0) 



1 



aN 



N 

a^N^ ^ 

ij=l fj,upX=l 



El u u p \\ 



This shows that limc^oo 7(^) = E'-iid that the first order moments of P{x, y) will be zero. 
What cannot be demonstrated rigorously, however, is that for a ^ oo the distribution P{x, y) 
is of a Gaussian form. This is impossible in principle, even for a — > oo. We could, for instance, 
choose an initial state J(0) for the student weight vector of the form Ji(0) ~ e~*, in which 
case the Gaussian assumption would be violated for short times. If we choose our teacher 
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vector of the form Bi ~ e~* the situation is even worse: now the system wih be forced to 
evolve into a macroscopic state with a non-Gaussian distribution P{x, y). It will be clear that 
all we can hope for is that for non-pathological initial conditions J{0) and non-pathological 
teacher vectors B one can derive a dynamic equation for P{x, y) with Gaussian solutions. 



4.4 Complete Training Sets: Batch Learning versus On-Line Learning 

Here we will work out the macroscopic equations for the batch versions of the Hebbian, 
perceptron and AdaTron learning rules, and compare the results to those of the on-line 
scenarios. It turns out that for these cases one can solve the macroscopic dynamical laws 
explicitly. We restrict ourselves to complete training sets. For a ^ oo and N ^ oo the 
(exact) results of the previous subsection can be written as 



d_ 
di 



Q = 2r]Q 2 J dxdy P{x,y)x sgn{y)T[Q 2 ; Q 2 x, y] -|- At]"^ J dxdy P{x, y)T'^ [Q 2 ; Q 2 x, y] 

(S 

-R = r] Jdxdy P{x,y)\y\J^[Q'2;Qhx,y] (S 



in which A = 1 for the on-line scenario and A = for the batch scenario. Of the distribution 
P{x, y) we know, without additional assumptions: 

P{x,y) = hm {{S[x-J-mi^B-^])f,)Q,R.t (x) = (y) = 0, (x^) = (y^) = 1, (xy) = R/Q 

If we now assume J(0) and B to be such that P{x, y) has a Gaussian shape, the above 
expressions for the moments immediately dictate that for both scenario's P{x, y) will be 
identical to (14). We now firstly recover our previous macroscopic equations (^0) for the 



case of on-line learning (A = 1), and secondly find that the macroscopic equations for the 
case of batch learning can, for any choice J-[. . .] of the details of the learning rule, be obtained 
from the on-line equations by simply removing from the latter all terms which are quadratic 
in the learning rate tj. This also holds if we write the macroscopic equations in terms of the 
observables {E,J), since the transformation {Q,R) — > {E,J) does not involve the learning 
rate rj. 

For the Hebbian rule .^[| J $,, B $] = 1 we obtain the macroscopic equations describing 



batch learning by elimination of the r/^ terms from the on-line equations (17,18), giving 



d ^ , [2 d ^ rism(TrE) PI , . 

J = ^cos{7tE)J- T,^ = - J \/- (89) 

dt V vr dt TTJ V vr 



We can solve these equations by exploiting the existence of a conserved quantity. If we define 
D{J, E) = Jsin(7r-E) we find, using (H), that f^D = 0, which allows us to express the length 
J{t) at any time as 

sin(7r£;o) 
sin(7r£/) 

Substitution into the differential equation for the generalization error E then leads to a single 
non-linear differential equation involving E only: 

d [2 r]sm^{TTE) 

—E 



dt V TT vrJo sin(7rii^o) 
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This equation is easily solved: 

m 

Asymptotically this gives 



1 f¥ 



Jo sin(7r-E( 



E 



tan(7r-E) tan(7r-Eo) 
Jo sin(7r£;o) 



(90) 



Asymptotically the gain in using the batch scenario rather than the on-line scenario is having 
a power law error relaxation of the form rather than t~2. 

For the perceptron rule J^[| J|; J-^, -B-^] = 9\-{J-^){B-^)] we obtain the macroscopic 
equations describing batch learning by elimination of the rj^ terms from the on-line equations 



(|2|j2|), giving 



dt 



r][l-cos{TTE)] 



dt 



rjsm('irE) 



/27r dt 7rV27rJ 

Here we find that the quantity D{J,E) = J[l+cos(7rii^)] is conserved, which leads to 

l+cos(7rE'o) 



(91) 



J = Jo 



l + cos(7r£;) 



Substitution into the differential equation for the generalization error E then again leads to 
a single non-linear differential equation involving E only: 



d 



E 



T] sin(7r£') [1 + cos (vr£')] 



dt 7r\/27rJo[l+cos(7r£'ojj 

which can be solved by writing t as an integral over and by using 

dx If, ,x 1 

- aogtan(-) + —— 

z y z i+co 



sin(x)[l+cos(x)] 



(see ^]). This results in 



m 



[l+cos(7r£;o)] 



1 4- /^^Os , 1 

log tan( ) H ; — — - 

. ^ 2 ' l+cos(7r£'o) 



-cos(x) 



log tan( 



-kE^ 



1 



l + cos(7r£;) 



Asymptotically we now find an exponential decay of the generalization error: 



(92) 



E e ^ ^ -'o(i+™s('^-E^o)) 



The gain in using the batch scenario rather than the on-line scenario for the perceptron 
learning rule is quite significant. The batch scenario gives an exponentially fast decay of the 
generalization error, compared to a power law relation of the form for on-line learning. 



Finally we turn to the AdaTron rule J|; J-^, B-^] = \J-$\ey^J-i)[B ■$,)]. Here we 
obtain the macroscopic equations describing batch learning by elimination of the rf terms 
from the on-line equations ( p^j29| ), giving 



d ^ vJ . „s . d ^ r7sin^(7r£^) 

J = -rjJE + —cos(TrE)sm(TTE) —E = -- ^ ^ 

dt ' TT ^ ' ^ ' dt 7r2 
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t 

Figure 13: Qualitative comparison of the evolution of the error for batch versus on-line 
learning rules with constant learning rates ij = 1. Solid lines: Hebbian rule (upper solid: 
on-line learning, lower solid: batch learning). Dashed lines: Peceptron rule (upper dashed: 
on-line learning, lower dashed: batch learning). Dotted lines: AdaTron rule (upper dotted: 
on-line learning, lower dotted: batch learning). 



The equation for the generalization error is already decoupled from the equation giving the 
evolution of the length J, and can be solved directly: 

m = ^ - (93) 

Asymptotically this behaves as 

E — 

rjt 

For the AdaTron rule there is only little to be gained in switching from on-line learning to 
batch learning. Both scenario's give a power law error relaxation of the form (albeit with 
different prefactors). 

We summarise the results of this section on batch learning with complete training sets 
in the table below, and also illustrate the differences between the batch results and the on- 



line results in figure 13. Whereas the error evolution for the batch versions of the Hebbian 
and AdaTron rules is almost identical, there is clearly a remarkable difference between the 
perceptron learning rule on the one hand and the Hebbian and AdaTron rules on the other, in 
the degree to which they benefit from being executed in a batch scenario rather than an on- 
line scenario. Only the perceptron rule manages to significantly capitalise on the advantage 
of batch learning (where all question /answer pairs in the training set D are available at each 
iteration step, rather than just a single question/answer pair) and realise an exponential 
decay of the generalization error. 
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GENERALIZATION ERROR IN PERCEPTRONS WITH BATCH LEARNING RULES 




Ijrclicl dilZcLliUli tJI I Ul 




Hebbian 


, Josin(7rSo) /tTf 1 ^ 1 


£■ - . J'osinCTrBo)^-! 


^ r/ \l 2h3.n{TzE) tan(7r£;o)J 




Perceptron 


t = f (1 + cos(7r^o))[lntan(^) + ^^^^ 

-lntan(f )-,^:^] 


E ^ e " -'oli+<:°=('r-Bo)l 


AdaTron 


^ _ TT r 1 1 1 
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5 Incomplete Training Sets 



5.1 The Problem and Our Options 

We have seen in the previous section that in the case of incomplete training set (where 
l^*! = aN) the equations for our famihar observables Q[J] = and R[J] = J B (or, 
equivalently, for |J| and the generahzation error Eg[J] = ^axccos{R[J]/ ^/Q\J])) no longer 
close, since the distribution P(x, y) (pO|) will no longer be Gaussian and cannot be written in 
such a way that its dependence on the weight vector J is only through the observables Q[J] 
and One can in fact show that for a < oo no finite set of observables will ever obey a 

closed set of dynamic equations. 

Closely related to this problem is the fact that our macroscopic equations always involve 
averages over the training set D, which for a < cxd will generally depend on the details of the 
choice made for the aN questions in D. Since we cannot expect to be able so solve the 
dynamics for any given microscopic realisation , . . . ,^"^} of the set D, we will be forced 
to restrict ourselves to calculating averages of observables over all possible realisations of the 
training set. In order to avoid thereby ending up with irrelevant statements (since we really 
aim to arrive at predictions for actual simulation experiments, rather than averages over 
many such predictions), it is of vital importance to focus on those observables which in the 
limit N ^ oo tend towards their averages over all possible training sets anyway. Numerical 
simulations show that macroscopic observables such as the generalisation- and training errors 
have this property: if e.g. one chooses the questions in the training set D at random from 
D = {—1, 1}^, one will simply observe that for large the curves for Eg and Et as functions 
of time are reproducible, and depend only on the relative size a of D, not on its detailed 
composition 



/ . aN 

hm {E,) = hm ((i?t))scts = lim ( / dJ pt( J|^\ • • • , -r. E ^[-(•^•^^)(^-^^)] 



(94) 

hm {E,) = hm {{E^)U. = hm ( f dJ pt{J\i\ ■ ■ ■ ^e"") (^[-(J-0(S-0])i?) sets (95) 

(with the microscopic probability density pt{J\D) for the student weight vector, given a 
realisation of the training set D). 

In equilibrium calculations the problem is often less severe, since in many cases one at 
least knows the stationary microscopic probability density poo{J\E>), so that one can write 
down the (exact) expressions for the equilibrium expectation values of the training- and 
generalization errors (^) and their averages over the realisations of the training set (94j95|). 



One can then work out these expressions and obtain transparent results in the N ^ oo 
limit upon exchanging the order of the various summations and integrations. The remaining 
problem is of a technical nature. In dynamical studies away from equilibrium, on the other 
hand, we usually do not have an expression for pt{J\D) at our disposal, and our problem is 
of a conceptual rather than a technical nature. In order to proceed we need to average over 
the realisations of the training sets, but we have as yet no object to average ... 

The toolbox of non-equilibrium statistical mechanics at present offers two (in a way com- 
plementary) techniques to deal with this situation, which is a familiar one in the field of 

^^This property is called 'self- averaging'. 
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disordered magnetic systems, namely the technique of generating functionals (involving path 
integrals) and dynamical replica theory. Following the generating functional route one per- 
forms the average over the realisations of the training sets on an object from which one can 
derive all relevant observables by differentiation. In the limit N ^ oo this procedure leads to 
exact equations for two-time correlation- and response functions, which, however, are highly 
complicated and can be solved in practice only near equilibrium. In dynamical replica theory 
one derives deterministic macroscopic equations for an observable function (equivalent to an 
infinite number of ordinary scalar observables), which are averaged over the realisations of 
the training set using the so-called replica method. Here one assumes that the chosen func- 
tion obeys closed deterministic equations in the — > oo limit; the exactness of the resulting 
theory depends on the degree to which this assumption is correct. Solving the resulting equa- 
tions numerically is feasible for transients, but as yet too CPU- intensive to allow for solution 
close to equilibrium. 



5.2 Route 1: Generating Functionals and Path Integrals 

This rather elegant approach, which to our knowledge has so far only been applied to learning 
rules with binary weights, is based on calculating a generating functional Z[ip] which is an 
average over all possible 'paths' {J{t)} {t > 0) of the student's weight vectors through the 
state space , given the dynamics (pi). 



= (e-^S«/o^^ '^'W^'W) (96) 



in which time is a continuous variable. As with all path integrals, averages such as ( |96[) are 
understood to be defined in the following way: {i) one discretises time in the dynamic equation 
(|6^), (a) one calculates the desired average, and subsequently (iii) one takes the continuum 
limit in the resulting expression. From ( p6| ) one can calculate all relevant single- and multiple 
time observables by functional differentiation. Averaging the generating functional over the 
possible realisations of the training set D gives relations such as 

{Ji{t))sets = i lim -— -(Z[^])sets (97) 

etc. Overall constant prefactors in Z[il)] can always be recovered a posteriori with the identity 
Z[0] = 1. 

The discretised version of our equation (p^) and the corresponding discretised expression 



for the generating functional (9€), with time-steps of duration A, would be 

(99) 



Pt+A(.J) = JdJ' l^5[J-J'] + AN[W[J;J']-5[J-J']]^Pt{J') (0 < A « 1) 

Z[rp] = (e"*i:.Eto^^»(^-^)^'(^-^)) (100) 

At the end of our calculation the dependence of any physical observable on A, other than via 
t = iA, ought to disappear. Note that, although (|99[) appears to be almost identical to (|59|) 
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(equation ( |59| ) can be obtained from ( p9D by choosing A = there is a crucial technical 



difference. In (|99|), in contrast to (59), we can control the parameter that converts time into 
a continuous variable (A) independently of the parameter that controls the fluctuations (N). 
This allows us to take the limit N ^ oo before the limit A ^ 0. The discretised process (|99| ) 
gives for the probability density P[J{tQ), . . . , J{ti)] of a temporally discretised path (with 
tn = nA): 

P[Jito),...,J{te)] = n \s[J{tn+l-J{tn)]+^N[W[J{tn+l-,J{tn)]-S[J{tn+l-J{tn)]]] 
n=0 ^ 

SO that we find for ( |100| ) after averaging over all possible training sets D: 

,t/A 



{Z m^cts = /•••/n [dJitn)i 




5[Jitn+l-Jitn)]+AN[W[Jitn+i;J{tn)]-S[Jitn+l-Jitn)]] ) sets (101) 



The problem has hereby again turned into a technical one, albeit of a highly non-trivial 
nature. The strategy would now be to (i) insert into ( |101| ) the recipe (|^) for the learning 
rule to be studied, (ii) introduce appropriate (5-distributions that will isolate all occurrences 
of the vectors £ D in (|101| ) in such a way that the average over all training sets can be 
performed, (iii) take the limit — > oo for finite A (this will lead to a saddle-point integral, 
involving integration variables with two time-arguments), {iv) take the limit A — > which 
restores the original dynamics and converts all integrals into path integrals, and finally (v) 
solve the saddle-point equations. 

The saddle-point equations will describe a non-Markovian stochastic dynamical problem 
for an effective single weight variable; it will involve a retarded self-interaction and a stochastic 
noise which is not local in time (i.e. with an auto-correlation function of finite width). This 
causes these saddle-point equations to be extremely hard to solve, especially in the transient 
stages of the learning dynamics. Here we will not follow this procedure further, mainly 
because for the types of rules we have been considering in this review such calculations have 
not yet been performed (this program has so far only been carried out for learning rules 
involving binary weight vectors J G { — 1,1}^). 

5.3 Route 2: Dynamical Replica Theory 

The second procedure to deal with incomplete training sets is closer to the methods used 
so far for dealing with complete training sets than the above formalism, since it involves 
macroscopic differential equations for single-time observables. The ground work has already 
been done in section four, where we found that for learning rules of the usual type (|6|), and 
under certain conditions, the evolution of macroscopic observables fl[J] = {0,i[J], . . . , il^[J]) 
is in the limit N ^ oo described by deterministic laws. With the short hand J-'[. . .] for 
J|; J-^, B-^], and with the definition of sub-shell averages introduced in section four 

,rfj.. JdJ MJ)f{J)S[n-n[J]] 
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these deterministic laws can be written as: 



(103) 



Batch : 



|n = r/(E(C.sgn(B.On--])z>^/a. 



dJidJj 



Sufficient conditions for ( |103 , 104| ) to hold for N ^ oo were found to be: 

1. All ri^[J] are of order unity for N ^ oo 

2. All are mean- field observables in the sense of (69) 

3. Qi[J] = J2 



4. For all < 



lim 



N—^oo ^ 



(104) 



in which the diffusion coefficients (for on-line and batch learning, respectively) are given by 



^3 



dJi 



= ^(E(^^ sgn(S.On--])D(0 sgn(B.On--])D 



d%[j] 



The basic idea of the formalism is to note that for those observables fi[J] which obey closed 
deterministic dynamical laws which are self-averaging in the limit ^ cxd, we can use 
(103,104) to fully determine these laws. If Q, obeys closed equations we know that, at least 
for A^ oo, the right-hand sides of ( |103 ,104) by definition cannot depend on the distribution 
of the microscopic probabilities Pt{J) within the fl-sub-shells of ( |102| ). As a consequence we 
can simplify the evaluation of ( |103| , |104| ) by making a convenient choice for pt{J)- one that 
describes probability equipartitioning within the ri-sub-shells, i.e. 



{fiJ))a,t - {f(J))n= jdj s[n-n[j]] 



(105) 



Combination of (105) with ( 103| ,104), and usage of the self-averaging property, then leads to 
the following closed and deterministic laws: 



On-Line : ^fi = ^l^^^ (((E^^ sgn(S.On ■ ■]^'^^ 



'D'Cl I sets 



+ ri lim ( 



((Ee.e.^^[... 



2J 



dJidJj 



'D'fl I sets 



(106) 
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Batch : |f! = „ Jim ((Ete -i^o'^^n) 

+ Jto^ (^(Ete sgn(B'f )^|. . -Dfife sg.i(B-{)^-|. . -Dfil^jf!) (107) 

Given the choice for the observables fi[J], our problem has now again been converted into a 
technical one. One performs the average over all training sets using the replica identity 



a=l 



sets 



The key question that remains is how to select the observables ft [J] , since although the theory 
is guaranteed to generate the exact dynamic equations for observables which indeed obey 
closed, deterministic and self-averaging laws, it does not tell us which observables will have 
these properties beforehand. If the chosen observables ft[J] do not obey closed deterministic 
laws, the method will generate an approximate theory in which one simply has made the 
closure approximation that all microscopic states J with identical values for the macroscopic 
observables fl[J] are assumed to be equally probable. The available contraints to guide us 
in finding the appropriate fi[J] are the four properties listed below equation (ll04| ) and the 
knowledge that we will need an infinite number (i.e. i oo), or. equivalently, an observable 
function. In addition, for those systems where the equilibrium microscopic probability density 
PooiJ\D) is know and is of a Boltzmann form, i.e. poo{J\D) ~ e~^^^'^^^\ one can guarantee 
exactness of the theory in equilibrium by choosing one of the observables to be H{J\D)/N 
(or equivalently a set of observables that determine H[J\D)/N uniquely), since in that case 
the equipartitioning assumption (105) is exact in equilibrium. 



For the learning dynamics of the type (|^, the results of section 4.3 automatically lead us 
to the following choice: 

Oi[J] = Q[J] = J2 VL2[J] = R[J] = J B VL^y[J] = P[x,y;J] = {d[x-J-m^-B<]) ^ 

(108) 

(with x,y £ JR). Note that here we have defined the distribution P[x,y;J] without explicit 
normalisation of J, i.e. with x = J-^ rather than x = J-^, which will make the subsequent 
equations somewhat simpler. The procedure for dealing with the distribution P[x, y; J] is to 
first represent it by a finite number of £ values P[Xfj,,y^; J] (e.g. as a histogram), and take 
the limit £ — > oo after the limit N ^ oo has been taken. It can be shown that the observables 



(108) satisfy the four conditions for obeying deterministic laws in the N ^ oo limit if all 
Bi = 0{N~2^ (demonstrating this is not entirely trivial in the case of the distribution 
P[x,y;J]). Working out the closed equations (|106| , 107] ) for the observables (|108| ) gives the 



following result: 
dt 



Q = 27] Jdxdy P[x, y]x sgn(y) V^; x, y] + I^rf j dxdy y]:^^ [^Q; x, y] (109) 
d f 

-R = 7]J dxdy P[x,y]\y\n^;x,y] (110) 
— P[a;, y] = [ sgn{y)J^[^/Q- x, y]P[x, y]J -Vq^ / dx'dy' sgn{y')T[y/Q; x' , y']A[x, y; x , y 
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1a 2 9' 



P[x,y] J dx'dy'P[x',y']F\^;x',y'] 



2-' - L-^'^W L-^ '^J- LV^,-^,«J (111) 

with A = 1 for on-line learning and A = for batch learning. Again we observe that the 
difference between the two modes of learning is reflected only in the presence/absence of the 
rf terms in the dynamic laws. All complications are contained in the function A[x,y]x' ,y'\, 
which plays the role of a Green's function, and is given by 

A[x,y-x',y'] = lim lim I [ f{ \dJ''5[Q-Q[J'']\5[R-Rm]X{5[P[x^,y^]-P[x^,y^-J^ 

xmx-j^ . my-B -m- m-^^^'W-j' ■ mv'-B ■ anh) sets (112) 

After a number of manipulations we can perform the average over the training sets and write 



(112) ultimately in the form 
■A[x,y;x',y'] = 



dxdx'dydy' i[^S:+x'x'+yy+yij'] ^ 

O A 



(2vr)^ 



lim lim [dqdqdQdkT] dP^{x'\y'')e^'^^^^^^^^^^^^^^C[x,y-x\y'-q,q,Q,^ 



n— >0 N^oo 

with 



^[•••] =i'^Qc{l-qa(x) +iR'^Ra + i'^qai3qai3 + i'^ / dx'dy" Pa{x" ,y")P[x" ,y"] 

a a aj3 " 



+ alogP[g,{P}] + lim - Vlog /rfcr e"*"'^^" 



(113) 



The functions C[. . .] and P[. . .] are given by complicated integrals. The term in the expression 
for A[. . .] involving lim„_»o and limAr^oo will be given by the intensive part C[. . .] evaluated 
in the dominating saddle-point of ^, and finally we get 



A[x,y;x',y'\ 



dxdx'dydy' 
(2vr)4 



-yy+yy\\\uiC[x,y-x\y'-q,q,Q,R,{P}] (114) 

n— >0 



in which the order parameters {q,q,Q, R, {P}} are calculated by extremisation of the func- 
tion ^'[. . .] (|113| ). The meaning of the order parameters in the relevant saddle point 
at any time t is given in terms of the (time-dependent) averaged probability distribution 
{Pt{q)) sets for the mutual overlap between the weight vectors J" and j'' of two independently 
evolving learning processes with the same realisation of the training set D. One can show 
(for N ^oo): 



{Pt{q)) 



sets 



)) 



{Pt{q))sets = hm — J2 ^[Q-Qo^p] (115) 

n^o n[n — l) 



At this stage one usually makes the so-called replica symmetric (RS) ansatz in the ex- 
tremisation problem, which in view of ( |115| ) is equivalent to assuming the absence of complex 
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ergodicity breaking (simple ergodicity breaking, i.e. with only a finite number of ergodic 
components, is still possible via the existence of multiple solutions for the replica symmetric 
saddle-point equations). This replica symmetric ansatz is usually correct in the transient 
stages of the dynamics. If with a modest amount of foresight we put 

qal3 = qoSal3 + q['i--Sal3], QajJ = ^i[r-ro6al3], Ra = ip, Qa = ■i4>, Pa{u,v) = ix[u,v] 

we end up, after a modest amount of algebra and after elimination of most of the scalar order 
parameters via the saddle-point equations, with an extremisation problem for a quantity ^I'rs 
involving only the function {x} and the scalar q: 

^Rs[g,{x}] = ^2{l-^ ~ 2{l-q) + ^^°s(l-g) - J dx'dy' P{x,y)x{x,y) 

+ aj DyDz log J dx e-5^'/Q(i-?)+^[^2^^^1+^x(^,?/) (ng) 
with the short-hands A = R/{l—q)Q and B = \J qQ — R"^/ {l—q)Q and the short-hand for the 

112 

Gaussian measure Dz = (27r)^2e~2^ dz (similarly for Dy). This result is surprisingly simple, 
compared to similar results for other complex systems of this class (such as spin-glasses and 
attractor neural networks near saturation). Firstly, it involves just a small number of order 
parameters to be varied (just q and the function x)- Secondly, if one works out the saddle- 
point equations one recovers from the formalism convenient relations such as / dx P{x, y) = 1 
for all X (this makes sense: the distribution of y = S-^ is Gaussian since the components Bi 
are statistically independent of the vectors in the training sets). 

The final solution provided by dynamical replica theory thus consists of the equations 
( 109 , 110 , 111| ), which are to be solved numerically, in which at each infinitesimal time-step 



one has to solve the saddle-point problem for ( |116| ). The training- and generalisation errors 
are then at any time simply given by: 

((-E't))scts = J dxdy 9[-xy]P[x, y] {{Eg))sets = ^ aiccos[R/ ^/Q] 

The need for solving a complicated saddle-point problem at each infinitesimal time-step 
explains why working out the predictions of the theory for very large times requires a pro- 
hibitively large amount of CPU time. However, the simple form of the present saddle-point 
equations hints at the possibility to introduce a more basic distribution P{x\y,z) for which 
the saddle-point problem is sufficiently trivial to allow for analytical solution, and which thus 
obeys a diffusion-type equation given in explicit form. The intuition developped in using this 
formalism for other systems with a comparable complex dynamics suggests that the equations 
resulting from this formalism will be either exact or a reliable approximation, especially in 
the transient stages of the learning process. 
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6 Bibliographical Notes 



The application of statistical mechanical tools to learning processes in artificial neural net- 
works was mainly initiated by the hugely influential study [Q]. It is impossible to list even 
a fraction of the papers that followed. Those interested in early applications of statistical 
mechanics to neural network learning can find their way into the literature via the dedicated 
(memorial) issue Q of Journal of Physics A (mostly on statics) and the early review paper 
Q . The approaches and styles of many subsequent statistical mechanical studies of learning 
dynamics were generated by the two influential papers and 0]. More recent reviews of 
the general area of the statistical mechanics of learning and generalisation (including both 
statics and dynamics) are |^ |9| . 

The on-line learning algorithms studied in section two were first introduced/studied in 
lllOll (perceptron rule) , [0] (Hebbian rule) and |]l^] ( AdaTron rule) , although at the time these 
algorithms were not yet studied with the methods described here. The convenient expression 
for the generalization error of binary perceptrons in terms of the inner product of the student 
and teacher weight vectors Eg = ^ arccos( J--B/| J|) appeared first in [|l3, 14, |l5|. In the latter, 



|15|, one first finds in an embryonic form the set-up of deriving closed macroscopic equations 
for the observables and J B. Many of the results we described on on-line learning with 
complete training sets in perceptrons with fixed rules and fixed learning rates can be found in 
1 16, 17, |l^, The general lower bound on the generalization error that can be achieved for 



a given number of question/answer pairs, translating into the lower bound Eg ~ 0.44. . . t~ 
for on-line learning rules, was derived in [^]. Calculations involving on-line rules with time- 
dependent learning rates can be found in |21, |l^. The systematic optimisation of learning 
rules to achieve the fastest decay of the generalization error in perceptrons was introduced 



already in |1£]. 



In one first finds the method to derive exact stochastic differential equations describing 
learning dynamics (an application of [§^]), followed by several studies aimed at extracting 
information from the microscopic dynamics directly]^ The differences between batch learning 
and on-line learning appear so far to have been addressed mainly in equilibrium calculations 



1 24, There is not yet much literature on learning dynamics with incomplete training 

sets, apart from simple cases and linear models such as 0. The generating function approach 
to the learning dynamics for incomplete training sets was elaborated for perceptrons with 



binary weights in [27|. The version of the dynamical replica theory calculations described in 



section five was developed in |28|, and is only now being applied to learning dynamics |29] 



Finally, even within the already confined area of statistical mechanical studies of the 
dynamics of learning we have specialised to the simplest models (binary perceptrons) and the 
simplest types of tasks (those generated by a noise- free and realisable teacher). As a result 
there are many interesting areas which we had to leave out, such as e.g. the dynamics of 
learning in the presence of noise, for unsupervised learning rules or for non-stationary teachers 
|17, |3^. The most important areas we were forced to leave out, however, are the large bodies 
of work done on different classes of learning rules, e.g. those involving continuous rather than 



^^This line of research, termed 'stochastic approximation theory', is sometimes presented as opposite to the 
approach based on deriving macroscopic equations, with only little scientific justification. The two approaches 
are mutually consistent and complementary; they simply concentrate on different levels of description and 
are (sometimes) worked out in different limits. In the present paper we used both, and switched from one to 
another whenever necessary. 
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binary neurons or those in the form of microscopic Fokker-Planck equations, as weU as (and 
especially) on various families of multilayer networks (mostly so-called committee machines, 
in which the weights connecting the hidden layer to the output neuron(s) are fixed). Relevant 
recent papers in these areas are e.g. 

The techniques described in this review can be applied with only minor adjustments and 
extensions to layered networks of graded response neurons, provided the number of neurons 
in the hidden layer(s) remains finite in the limit N oo. As soon as we move to layered 
networks in which the number of hidden neurons scales proportional to N, on the other hand, 
we again face the problem of macroscopic dynamic equations which fail to close. Solving this 
problem, and the one of handling incomplete training sets, are the key objectives of most 
present-day research efforts in the research area of the statistical mechanics of learning. 
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A Appendix: Integrals 

In this appendix we give brief derivations of those integrals encountered throughout this 
paper that turn out to be easy, and give the appropriate reference for finding the nasty ones. 
All involve the following Gaussian distribution: 



{f{x,y)) = / dxdy f{x,y)P{x,y) 

I: ^1 = {\y\) 



P{x,y) 



1 



II: h = {x sgn(y)) 
h = 



e 2W 



^"^"^^ sgn(y)e-i[^'-2-^l/(i-')(l-a;^)|-e-^-'/(^--') 



27r^/T 



dx 



dxdy __i[,2+^2_2^,^]/(i_^2) sgn(y)^y 



27r\/T 



=e 2^ 



uj{\y\) = uj 



III: h = {0[-xy]) 



dxdy ^_i^^2^y2^2uxy]/{i^w^) 



/O JO VTa/I — 71" 

Introduce polar coordinates = r(cos sin (/>): 



dxdy e-|[^'+?''+2'^^y] 



vr Jo 







2 /-TT 



1 



TT ( 

arctan 

2 



27r JO l+wsin(</)) vr 
(the last integral can be found in [||]). Finally, using cos[^ — ijj] = sin ip, we find 



Is = — arccos(a;) 
vr 



IV: /4 = (x sgn(y)6'[-xy]) 



l-UJ 



dxxe 2^ / dye '"-^j ' 2- 
vr Jo Jo 



1 

vr 



e 2^ 



1 



+ 



2vr 7r\/r 



JO 



i / dxe ^y^-^ [ ^dye 

vr Jo ox Jojx/Vi^ 



2vr 
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V: 75 = {\y\0[-xy]) 



poo poo 

h= / dxdy y[P{x,-y)+P{-x,y)\ 
Jo Jo 



/27r 

VI: Is = {x^e[-xy]) 

i, = —L= / dxdyx'e-'^^^'+y'+'^y-^/^'--'^ 

7:Vl—co Jo Jo 

1 P OO /* oo 

= i / / dxdy + y2) g-i[a.2+2y2+2xH/(l— ^) 

27r-\/l— Jo Jo 

We switch to polar coordinates {x,y) = r{cos9,sm6), and subsequently substitute 
t = r2[l+a;sin(26l)]/[l-u;2]. 

1 /'7r/2 poo 

dO (^^^3g-ip+2a;r2cosesin0]/(l-a;2) 



27rVl-w2 Jo 

(l_a;2)3/2 dO 



r/^ dO f°° , _if 

/ 7 T^TTTTT / dtte 2^ 

Jo (l+iusm(26)y Jo 



47r Jo (1 +wsin(26'))2 7o 

(1_ ^2)3/2 



27r Jo (1 + sin^)2 
To calculate the latter integral we define 



These integrals obey 

d ~ ~ 

UJ—In - nin+l = -nin 

dw 

so 

d ~ ~ 2 

/2 = A + Ji ii = , arccos(a;) 

duj VI - 

(where we used the integral already encountered in III). We now find 

(1_ ^2)3/2 (i_^2^ O^VT^ 

Je = -z -'2 = arccos(a;) 1 arccos(a;} 

27r TT TT TT 

VII: /7 = (NM0[-a:2/]) 

dxdy 



io Jo ttVI — 



We use the relation 
to give us, using VI: 



l7 = - / dyye -'T^) -u;{y^e[-xy]) 

vr Jo 



vr Jo 

(l-u;2)3/2 u;(l-a;2) a;2^_^. ^, 



arccos(a;) H arccos(a;) 



TT TT TT TT 
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VIII: Is{x) = Jdy e[y]Pix,y) 



Jo 



dy 



e 2 



i[a;2+2/2-2a;j/a)]/(l-a)2) 



e a-' 



dy e 



Jo 

IX: Ig{x) = Jdy e[y]{y-ux)P{x,y) 



2V27r 



1+erf 



iOX 



.V2VT- 



Ig{x) 



Vl-LO^ TOO 



27r 



poo 
JO 



d_ 
dy 



e 2 



i[a;2+y2_2a;j/a;]/(l-a;2) 



27r 



-e 2- 
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